August 2021
tl;dr: Simplified tracking mechanism from CenterPoint.
Overall impression
SimTrack stands for simplified tracking. This paper is heavily inspired by CenterPoint. Instead of using the predicted motion for better matching (data association) in tracking, SimTrack simplified that matching to a simple look-up (or read-off as in the paper).
The usage of combined map during inference time resembles a simple version of convLSTM (memory). This helps the network to look beyond two immediate frames and thus handle occlusion better. It is not trained online specifically but only leveraged in inference time.
The key design is to predict the first-appear location of each object in a given snippet to get the tracking identity and then update the location based on motion estimation.
Difference with CenterPoint:
- CenterPoint treat the forecasted detection as a bridge for object mat hing instead of using them as the final tracking output.
- CenterPoint only cares about the location relationship objects between frames, but not their confidence. CenterPoint introduces the confidence of objects between frames via pooling features from previous frames.
Key ideas
- SimTrack simplifies the tracking mechanism with improvements
- Replaces bipartite matching with simple read-off. –> This may be improved with nearest neighbor look up
- track life management: previous feature map is added to the current frame, after ego motion compensation. If the memory fades, then
- Architecture: can be used as an add-on to any voxel or pillar based lidar object detector (such as PointPillars and VoxelNet).
- Hybrid time centerness map
- Motion updating: predicting the movement between two consecutive frames, and used to update centerness map during inference.
- Regression: same as CenterPoint.
Technical details
- The memory mechanism is not trained end to end, but rather there is a discrepancy between training and inference. –> this may be a future direction for improvement
- point input (x, y, z, r, $\Delta_t$) where $\Delta_t$ is the relative timestamp to the current sweep.
- Ground truth handling in hybrid map (first appear)
- Continuous track in t-1 and t: heat map in t-1 as positive
- Dead track in t: negative
- New track in t: heat map in t as positive
- Tracking by detection still remains the predominant method in tracking.
- AdamW + one cycle LR, following CenterPoint.
- Evaluation: nuScenes uses 2 m in BEV, and Waymo uses 0.7 3D IoU to denote TP.
- Input pillar size is 0.2 m, same as CenterPoint for apple-to-apple comparison.
- Additional latency is only 1-2 ms (on Titan RTX though).
Notes
- Code will be available on github.
- Two questions sent to the 1st author:
- Q: I think the better performance of SimTrack to handle occlusion does not only come from the better track management, but more from the operation of concatenating previous updated centerness map. This resembles a hand-crafted LSTM in inference time and introduces long term memory. If we remove the combined map (as discussed in 4.5), would SimTrack still be able to handle occlusions? My guess is that it would be no better than CenterPoint.
- A: If removing the combined map, I did not check the quantitative results, but I guess the answer is NO. Centerpoint manually keeps the dead objects to handle occlusion, while our approach uses combined confidence. So I think combing maps is necessary.
- Q: The simple operation of read-off (or look-up) is implicitly based on the assumption that the prediction is accurate enough to make the prediction (detection at t-1 plus motion) and the detection of the next frame (detection at t) fall under the same feature map grid. Essentially it also has an implicit threshold – the grid size of the centerness feature map (instead of an explicit one as in CenterPoint, 1 m, 4 m, etc).
- A: Yes, the read off can be think of as using the grid size as the threshold. A subtle difference is that centerpoint first sort the detections according to the confidence score and then greedy match in descending order. This may cause some ambiguity, especially for the low-confidence detections due to occlusions. As discussed in the paper, the detection score is not always correlated with the matching confidence. That is why our approach perform better under the high-recall even when centerpoint uses the same detection results as ours.
- SimTrack improves both detection and tracking. –> Honestly I think the detection improvement is the key to the success of this paper.