October 2019
name | Time | venue | title | tl;dr | predecessor | backbone | 3d size | 3d shape | keypoint | 3d orientation | distance | 2D to 3D tight optim | required input | drawbacks | tricks and contributions | insights | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mono3D | 1512 | CVPR 2016 | Mono3D: Monocular 3D Object Detection for Autonomous Driving | The pioneering paper on monocular 3dod, with tons of hand crafted feature | Mono3D | Faster RCNN | from 3 template per class | None | None | scoring of dense proposal | scoring of dense proposal | None | 2D bbox, 2D seg mask, 3D bbox | shared feature maps (mono3D) | |||
Deep3DBox | 1612 | CVPR 2017 | Deep3dBox: 3D Bounding Box Estimation Using Deep Learning and Geometry | Monocular 3d object detection (3dod) by using 2d bbox and geometry constraints. | Deep3DBox | MS-CNN | L2 loss for offset from subtype average | None | None | multi-bin for yaw | 2D/3D optimization | the original deep3DBox optimization | 2D bbox, 3D bbox, intrinsics | locking in the error in 2D object detection | |||
Deep MANTA | 1703 | CVPR 2017 | Deep MANTA: A Coarse-to-fine Many-Task Network for joint 2D and 3D vehicle analysis from monocular image | Predict keypoints and use 3D to 2D projection (Epnp) to get position and orientation of the 3D bbox. | None | cascaded Faster RCNN | template classification scaled by a scaling factor | template classification scaled by a scaling factor | 36 keypoints | 6DoF pose by 2D/3D matching Epnp | 6DoF pose by 2D/3D matching Epnp | None | 2D bbox, 3D bbox, 103 3D CAD with 36 keypoint annotation | semi-auto labeling by putting template into 3D bbox | |||
3D-RCNN | 1712 | CVPR 2018 | 3D-RCNN: Instance-level 3D Object Reconstruction via Render-and-Compare | inverse graphics, predict shape and pose, render and compare | Deep3DBox | Faster RCNN | subtype average | TSDF encoding, PCA, 10-dim space | 2D projection of 3D center | viewpoint (azimuth, elevation, tilt) with improved weighted average multi-bin | find d by moving along ray angle until 3d tightly fit 2D | yes, move 3D along ray until fit tightly into 2D bbox | 2D bbox, 3D bbox, 3D CAD | ||||
MLF | 1712 | CVPR 2018 | MLF: Multi-Level Fusion based 3D Object Detection from Monocular Images | Estimate depth map from monocular RGB and concat to be RGBD for mono 3DOD. | Deep3DBox | Faster RCNN | offset from whole dataset average | None | None | multi-bin, and SL1 for cos and sin | MonoDepth, SL1 for depth regression | None | 2D bbox, 3D bbox, pretrained depth model | pretrained depth model | point cloud as 3-ch xyz map | ||
MonoGRNet | 1811 | AAAI 2019 | MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization | Use the same network to estimate instance depth, 2D and 3D bbox. | MonoGRNet | MultiNet (YOLO+RoIAlign) | regress 8 corners in allocentric coordinate system | None | 2D projection of 3D center | regress 8 corners in allocentric coordinate system | instance depth estimation (IDE) according to a grid | 2D bbox, 3D bbox, intrinsics, depth map | requires depth map for training | 2D/3D center loss, local/global corner loss; stagewise training to start 3D after 2D | instance depth estimation: pixel level depth estimation does not focus on object localization by design; depth of the nearest object instance | ||
OFT | 1811 | BMVC 2019 | OFT: Orthographic Feature Transform for Monocular 3D Object Detection | Learn a projection of camera image to BEV for 3D object detection. | OFT | ResNet18+ResNet16 top down network | L1 loss for offset from subtype average in log space | None | None | L1 on cos and sin | positional offset in BEV space from local peaks | None | 2D bbox, 3D bbox (intrinsics learned) | TopDown network to reason in BEV | |||
ROI-10D | 1812 | CVPR 2019 | ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape | Concat depth map and coord map to RGB features + 2DOD + car shape reconstruction (6d latent space) for mono 3DOD. | Faster RCNN with FPN | offset from whole dataset average | TSDF encoding, 3D Autoencoder, 6-dim space | None | 4-d quaternion | regress depth z | None | 2D bbox, 3D bbox, intrinsics, pretrained depth model | 8-corner loss; stagewise training to start 3D after 2D | ||||
Pseudo-Lidar | 1812 | CVPR 2019 | Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving | estimate depth map from RGB image (mono/stereo) and use it to lift RGB to point cloud | Pseudo-lidar | Frustum-PointNet / AVOD | 3DOD on point cloud | None | None | 3DOD on point cloud | DORN depth estimation | None | 2D bbox, 3D bbox, intrinsics, pretrained depth model | pretrained depth model | data representation matters | ||
Mono3D++ | 1901 | AAAI 2018 | Mono3D++: Monocular 3D Vehicle Detection with Two-Scale 3D Hypotheses and Task Priors | Mono 3DOD based on 3D and 2D consistency, in particular landmark and shape recon. | DeepMANTA | SSD for 2D bbox, stacked hourgalss for keypoint, monodepth for depth | N basis shape (N=?) | 14 landmarks | CE cls on 360 bins | MonoDepth | L1 loss | 2D bbox, 3D bbox, pretrained depth model, 3D CAD model with keypoints | cars should staty on the ground, should look like a car, and should be at a resaonable distance. How to ensure 2D/3D consistency between generated 3D vehicle hypothesis. | ||||
GS3D | 1903 | CVPR 2019 | GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving | Get 3D bbox proposal (guidance) from 2D bbox + prior knowledge, then refine 3D bbox through surface features | FasterRCNN with VGG16 (2D+O) | subtype average | None | None | from RoIAligned features (possibly multibin) | approximated with bbox height * 0.93 | None | 2D bbox, 3D bbox, intrinsics | quality aware loss, surface feature extraction | ||||
Pseudo-Lidar Color | 1903 | ICCV 2019 | Accurate Monocular 3D Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving | Concurrent proj with Pseudo-lidar but with color embedding | Pseudo-lidar | Frustum-PointNet | 3DOD on point cloud | None | None | 3DOD on point cloud | various pretrained depth weight | None | 2D bbox, 3D bbox, intrinsics, pretrained depth model | ||||
BirdGAN | 1904 | IROS 2019 | BirdGAN: Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles | Learn to map 2D perspective image to BEV with GAN | BirdGAN | DCGAN | oriented 2DOD on BEV point cloud | None | None | oriented 2DOD on BEV point cloud | oriented 2DOD on BEV point cloud | None | 2D bbox, 3D bbox (intrinsics learned) | In the clipping case, the frontal detectable depth is only about 10 to 15 meters | |||
FQNet | 1904 | CVPR 2019 | FQNet: Deep Fitting Degree Scoring Network for Monocular 3D Object Detection | Train a network to score the 3D IOU of a projected 3D wireframe with GT. Train a network to score the 3D IOU of a projected 3D wireframe with GT. | Deep3DBox | MS-CNN | k-means clustering and multi-bin | None | None | k-means clustering and multi-bin | approximated via optimization | similar to Deep3DBox (details in appendix) | 2D bbox, 3D bbox, intrinsics | ||||
MonoPSR | 1904 | CVPR 2019 | MonoPSR: Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction | 3DOD by generating 3D proposal first and then reconstructing local point cloud of dynamic object | Deep3DBox, Pseudo-lidar | MS-CNN | L2 loss for offset from subtype average | None | None | multi-bin for yaw | approximated with bbox height, then regress the residual from RoIAligned feature | None | 2D bbox, 3D bbox, intrinsics | shared feature maps (mono3D0 | |||
MonoDIS | 1905 | ICCV 2019 | MonoDIS: Disentangling Monocular 3D Object Detection | end2end training of 2D and 3D heads on top of RetinaNet for monocular 3D object detection | MonoGRNet | RetinaNet+2D/3D head | offset from whole dataset average, learned via 3D corner loss | None | 2D projection of 3D center | learned via 3D corner loss | regressed from dataset average, learned via 3D corner loss | None | 2D bbox, 3D bbox, intrinsics | signed IoU loss (pulls together even before intersecting), disentangle learning | disentangling transformation to split the original combinational loss (e.g., size and location of bbox at the same time) into different groups, each group only contains the loss of one group of parameters and the rest using the GT | ||
monogrnet_russian | 1905 | MonoGRNet 2: Monocular 3D Object Detection via Geometric Reasoning on Keypoints | Regress keypoints in 2D images and use 3D CAD model to infer depth | DeepMANTA | Mask RCNN with FPN | SL1 loss for offset from subtype average in log space | 5 CAD | 14 landmarks | multi-bin for yaw in 72 non-overlapping bins | approximated with windshield height | None | 2D bbox, 3D bbox, intrinsics | semi-auto labeling by putting template into 3D bbox | ||||
Pseudo-Lidar end2end | 1905 | ICCV 2019 | Pseudo lidar-e2e: Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud | End-to-end pseudo-lidar training with 2D/3D bbox consistency loss | Pseudo-Lidar | Frustum-PointNet | 3DOD on point cloud | None | None | 3DOD on point cloud | DORN depth estimation | bbox conistency loss | 2D bbox, 2D seg mask, 3D bbox, intrinsics | pretrained depth model | 2D/3D bbox consistency | ||
Shift RCNN | 1905 | IEEE ICIP 2019 | Shift R-CNN: Deep Monocular 3D Object Detection with Closed-Form Geometric Constraints | Extend the work of deep3Dbox by regressing residual center positions. | Deep3DBox | Faster RCNN | L2 loss for offset from subtype average | None | None | cos and sin, with unity constriant | approximated via optimization | Slightly different from Deep3DBox | 2D bbox, 3D bbox, intrinsics | ||||
BEV IPM OD | 1906 | IV 2019 | BEV-IPM: Deep Learning based Vehicle Position and Orientation Estimation via Inverse Perspective Mapping Image | IPM of the pitch/role corrected camera image, and then perform 2DOD on the IPM imag | YOLOv3 | oriented 2DOD on BEV point cloud | None | None | oriented 2DOD on BEV point cloud | oriented 2DOD on BEV point cloud | None | 2D bbox, BEV oriented bbox, IMU correction | up to 40 meters | Motion cancellation using IMU | IPM assumptions: 1) road is flat 2) mounting position of the camera is stationary –> motion cancellation helps this. 3) the vehicle to be detected is on the ground | ||
Pseudo-Lidar++ | 1906 | Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving | Improve depth estimation of pseudo-lidar with stereo depth network (SDN) and sparse depth measurements on landmark pixels with few-line lidars. | Pseudo-lidar | Frustum-PointNet / AVOD | 3DOD on point cloud | None | None | 3DOD on point cloud | PSMNet finetuned stereo depth | None | 2D bbox, 3D bbox, pretrained depth model, sparse lidar data | use sparse lidar to correct depth, stereo depth loss | ||||
SS3D | 1906 | SS3D: Monocular 3D Object Detection and Box Fitting Trained End-to-End Using Intersection-over-Union Loss | CenterNet like structure to directly regress 26 attributes per object to fit a 3D bbox | U-Net like arch | log size | None | 8 3D corners projected to 2D | cos and sin (multibn not suitable) | direclty regress | None | 2D bbox, 3D bbox, intrinsics | models uncertainty, direclty regress 26 number, 20 fps inference | |||||
TLNet | 1906 | CVPR 2019 | TLNet: Triangulation Learning Network: from Monocular to Stereo 3D Object Detection | Place 3D anchors inside the frustum subtended by 2D object detection as the mono baseline | Faster RCNN with two refine stages | refined from dataset average | None | None | refined from 0 and 90 degrees anchors | refined from 3D anchors | None | 2D bbox, 3D bbox, intrinsics | stereo coherence score and channel reweighting | ||||
M3D-RPN | 1907 | ICCV 2019 | M3D-RPN: Monocular 3D Region Proposal Network for Object Detection | Regress 2D and 3D bbox parameters simultaneously by precomputing 3D mean stats for each 2D anchor. | Faster RCNN | log size times 3D anchor size | None | None | smooth L1 directly on angle, postprocess to refine | None | 2D bbox, 3D bbox, intrinsics | angle postprocessing | 2D anchor with 2D/3D properties, depth aware conv, neg log IoU loss for 2D detection, directly regress 12 numbers | Reliance on additional sub-networks introduces persistent noise | |||
ForeSeE | 1909 | ForeSeE: Task-Aware Monocular Depth Estimation for 3D Object Detection | Train a depth estimator focused on the foreground moving object and improve 3DOD based on pseudo-lidar. | Pseudo-lidar | Frustum-PointNet / AVOD | 3DOD on point cloud | None | None | 3DOD on point cloud | learn foreground/background depth individually | 2D bbox, 3D bbox, depth map | Depth combination: Element-wise maximum value of confidence vector in C depth bins are obtained, and pass through a softmax | Not all pixels are equal. Estimation error on a car is much different from the same error on a building. | ||||
CenterNet | 1904 | Objects as Points | Object detection as detection of the center point of the object and regression of its associated properties. | CenterNet | DLA (Unet) | L1 loss over absolute dimension | None | None | multi-bin for global yaw in two overlapping bins | L1 loss on 1 over regressed disparity | None | 2D bbox, 3D bbox, intrinsics | highly flexible network | ||||
Mono3D Track | 1811 | ICCV 2019 | Joint Monocular 3D Vehicle Detection and Tracking | Add 3D tracking with LSTM based on mono3d object detection. | Deep3DBox | Faster RCNN | L1 loss for offset from subtype average | None | 2D projection of 3D center | multi-bin for local yaw in two bins | L1 loss on 1 over regressed disparity | None | 2D bbox, 3D bbox, intrinsics | regressing 2D projection of 3D center helps recover amodal 3D bbox | |||
CasGeo | 1909 | 3D Bounding Box Estimation for Autonomous Vehicles by Cascaded Geometric Constraints and Depurated 2D Detections Using 3D Results | Extends Deep3DBox by regressing the 3d bbox center on bottom edge and viewpoint classification | Deep3DBox | MS-CNN | refined from subtype average | None | 2D projection of bottom surface center | multi-bin for yaw, viewpoint estimation | approximated via optimization (Gauss-Newton) | similar to Deep3DBox | 2D bbox, 3D bbox, intrinsics | regress 3d height projection to help with initial guess of distance | ||||
GPP | 1811 | ArXiv | GPP: Ground Plane Polling for 6DoF Pose Estimation of Objects on the Road | Regress tireline and height and project to the best ground plane near the car | GPP | RetinaNet+2D/3D head | refined from subtype average | None | 2D projection of tirelines (observer facing vertices) | coarse (8) viewpoint classification | IPM based on best fitting ground plane | None | 2D bbox, 3D bbox, intrinsics, fitted road planes | Need to collect and fit road data | able to predict local road pose | NA | |
MVRA | 1910 | ICCV 2019 | MVRA: Multi-View Reprojection Architecture for Orientation Estimation | Build the 2D/3D constraints optimization into neural network and use iterative method to refine cropped cases. | Deep3DBox | Faster RCNN | refined from subtype average | None | None | multi-bin for yaw, viewpoint estimation, iterative trial and error for truncated | approximated via optimization | similar to Deep3DBox (details in appendix) | 2D bbox, 3D bbox, intrinsics | predict better for truncated bbox | NA |