August 2021
tl;dr: Patch-based refinement module to predict vehicle orientation based on wireframes.
Overall impression
EgoNet extracts heatmaps of object parts from local object appearances, which are mapped to the screen coordinates and then further lifted to 3D object pose.
The wireframe (or sparse 3D point cloud representating an interpolated cuboid) idea is similar to that of RTM3D and KM3D-Net and FQNet.
Key ideas
- Architecture
- FCN to extract 2D heatmaps of k keypoints. This is supervised by 2D heatmap loss.
- FCN to convert heatmaps to k patch coordinates.
- Affine transformation (scaling and translation) from patch coordinates to image coordinates
- Lifter MLP to lift to k-1 camera coordinates. As the translation is irrelevant to pose estimation, 3D coordinates are normalized relative to the centroid.
- Postprocessing: lifting 3D points to 3D orientation. –> This step is NOT detailed in the paper.
- Cross-ratio loss function: invariant to projection, and can be used based on unannotated images.
- cross-ratio ((v3-v1)(v4-v2))/((v3-v2)(v4-v1))
- used unannotated apollo3D patches (still need 2d bbox though) to help with model generalization
- Loss
- Heatmap loss
- 2D loss (in image plane)
- 3D loss (in camera coordinate system)
- Cross-ratio (CR) loss
Technical details
- EgoNet can be used as a module to improve existing single-stage 3D object detectors. –> It can be applied to both camera or lidar input as well.
- The estimation from 3D points to Euler is implemented by SVD (code).
- Even with only 9 points, it is possible to do self-supervised learning.
- Based on a perfect detection, AOS of EgoNet can go up to 99%, significantly higher than
Notes