November 2019
tl;dr: Regress tireline and height and project to the best ground plane near the car.
Overall impression
GPP generates artificial 2D landmarks with 3D bbox annotation. It purposefully predicts more attributes than needed to estimate 3D bbox (over-determined) and uses these predictions to from maximum consensus set of attributes, in a manner similar to RANSAC, making it more robust to outliers.
Key ideas
- Viewpoint (orientation) classes: 4 x 2. Depending on whether the central edge is on the left half or right half of 2D bbox (if local yaw is beyond 20 deg in a typical car).
- Ground Plane Polling:
- Given a plane candidate, get the projection of three tirelines. Form a virtual vertical backplane edge
- Find the nearest point on the backprojected ray of backplane edge top point to the virtual edge (in practice they do not intersect).
- 4 3D points form 6 edges pairs. The residual error of the 6 edges and real 3D length
- The best fit plane minimizes the residual loss
- Directly enforcing orthogonality led to most probable plane being discarded
- Discard the tireline corresponding to width of the car (only using side tireline) to enforce orthoganality
- Reconstruct the 3D bbox in a layer
Technical details
- RetinaNet backbone, classify into 8*K classes, 8 being the orientation class.
- Using RANSAC to create 22k ground plane candidates based on KITTI. This is with tight constraint (t = 2 cm) and very high probability of success (p = 0.999). In experiment, 10K planes are used.
- The plane is denoted by 4 numbers (as it is 4 DoF).
- Deep3Dbox cannot handle closeby objects well as the error goes up with very close distance.
Notes
- Maybe we can enforce all object within the image are on the ground. to make better prediction.
- The 2D/3D tight constraint looks invalid based on Fig. 5. Maybe not for closeby cars.