Learning-AI

PointTrack: Segment as Points for Efficient Online Multi-Object Tracking and Segmentation

July 2020

tl;dr: Use PointNet to extract embedding vectors from instance segmentation masks for tracking.

Overall impression

This work tackles the newly created track of MOTS (multiple object tracking and segmentation) CVPR 2019. It uses tracking-by-segmentation. It uses existing work of instance segmentation, and the main contribution lies in the association/tracking part.

PointTrack uses a single stage instance segmentation method with a seeding location. This makes it compatible with many instance segmentation method, such as CenterMask or BlendMask.

deepSORT extracts features from image patches, while PointTrack extracts features from 2D point cloud.

Key ideas

Instance segmentation with SpatialEmbedding
Tracking Architecture
- Regard 2D image pixels as un-ordered 2D point clouds and learn instance embeddings.
- Get context-aware patch: Dilate bbox by 20% to include more environment
- Uniformly sample 1000 points on the object an 500 points on the background
- Use color, position offset from center, and class category to encode each 2D cloud point.
  - Position is lifted from 4-dim to 64 dimension following Transformer to make it easier to learn.
Association of instances across frames is done with embedding distance and segmentation mask IoU.
Seed consistency:
- Using Optical flow and last seed to encourage consistent seed. –> This additional optical flow map will definitely help CenterTrack.
- Penalize difference between the warped seed from last frame with optical flow and the seed predicted from current frame.
Training instance embedding:
- PointTrack consists of D track ids, each with three crops with equal temporal space. It does not use 3 consecutive frames to increase the intra-track-id discrepancy. The space S is randomly chosen between 1 and 10.
- PointTrack++ finds that for environment embedding, making S>2 does not converge, but for foreground 2D point cloud a large S (~12) helps to achieve a higher performance. Thus the embeddings are trained separately. Then the individual MLP weights are fixed, and a new MLP is trained to aggregate these info together.

Technical details

Points with highest (top 10%) importance can be visualized by their weights, a natural feature from PointNet embedding.
Visualization of instance embedding with T-SNE is also quite interesting.
Ablation study showed that the removal of color in the input leads to the biggest drop.

Notes

Questions and notes on how to improve/revise the current work