November 2021
tl;dr: BEV object detection with DETR structure.
Overall impression
Inspired by DETR, the paper uses sparse queries in BEV space for BEV object detection. It manipulates prediction directly in BEV space. It does not rely on dense depth prediction and avoids reconstruction errors. It is in a way similar to STSU.
Mono3D methods have to rely on per-image and global NMS to remove redundant bbox in each view and in the overlap regions.
The work is further improved with 3D positional embedding by PETR and PETRv2.
The extension of DETR3D to temporal domain is relatively straightforwad, using the 3D reference point, transforming to the past timestamps using ego motion, and then project to the images from the past timestamps.
Key ideas
- Iterative refinement of object queries.
- The iterative refinement process is similar to Sparse RCNN and Cascade RCNN.
- Predicts the bbox centers (decoded from queries), project centers back to images with IPM, and sample feature points with bilinear interpolation and integrate them into queries.
- It also works with only a single pass, but with iterative refinement the results gets better. L=6 in this paper.
- Performance are much better in overlap regions where objects are more likely to be cut-off.
Technical details
- Initialization seems to matter quite a lot for transformers based networks. If the network is pretrained from FCOS3D, the performance is boosted by 5 abs points (See table 1).
- Ground-truth depth supervision yields more realistic pseudolidar point cloud than self-supervised depth. That is perhaps the reason why DORN instead of Monodepth2 pretrained weights are preferred for pseudo-lidar papers.
- DETR3D also predicts velocity, thus 7+2=9 DoF bbox. –> But why? The predicted velocity must be unreliable due to lack of a memory module.
- AdamW is the optimizer.
Notes
- This paper uses sparse transformation instead of transforming the entire feature to BEV. This is what Andrej Karpathy mentioned the next step of FSD in his Tesla AI day talk 2021.