December 2019
tl;dr: First real-time instance segmentation, by linearly combining activation maps and crop with bbox.
Overall impression
This is a well written paper with great idea, and really solid engineering work.
Most previous works (Mask RCNN) depends on feature localization (feature repooling) and repool the features to predict a fixed-resolution (14x14 or 28x28) mask. This is inherently sequential and hard to speed up.
Key ideas
- YOLACT breaks instance segmentation into two parallel tasks:
- 1) Generating a set of prototype masks –> using FCN which are good at producing spatially coherent masks
- 2) Predicting per-instance mask coefficients –> using fc to producing semantic vectors
- The assembly step is a simple linear combination realized by matrix multiplication
- The prototype masks are independent of categories. It learns a distributed representation and each instance mask is a linear combination of the prototypes. Prototype masks are image dependent and do not depend on any specific instance.
- The emerging behavior of prototypes is interesting! (Fig. 5): some is position sensitive, and some detect contours.
- Advantages to Mask RCNN
- Fast. The entire mask branch takes only ~5 ms!
- Temporally stable. Single stage. Segment whole image then crop.
- Better quality for larger object. No fixed size mask prediction
- Disadvantages to Mask RCNN
- Worse overall performance. –> mainly in detection quality. In high threshold, it is even better than Mask RCNN.
- YOLACT may leads to leakage, if bbox is not accurate.
Technical details
- Predict prototype masks in 1/8 scale P3 (finest scale in FPN) and upscale to 1/4.
- Use ReLU for unbounded activations.
- Predict c+4+k coefficients. tanh after k coefficients to enable subtraction.
- Mask assembly: $M = \sigma(PC^T)$, P is hxwxk and C is nxk. n is number of instances. The masks are probabilistic, thus using a sigmoid.
- ResNet is not exactly translation variant because of padding.
- k = 32 is the best. Adding more prototypes usually adds duplicates, and makes learning coefficients harder.
- Fast NMS performs NMS in parallel, allowing those otherwise be removed bbox to suppress lower bbox scores as well. This hurts 0.3 mAP.
- The performance gap between YOLACt and mask RCNN is due to bbox prediction but not mask quality (same AP gap between mask and bbox performance.)
- Auxiliary task of predicting semantic segmentation map on top of prototype masks. This auxiliary task is only enabled during training and thus do not have speed penalty.
Notes