October 2019
tl;dr: Get 3D bbox proposal (guidance) from 2D bbox + prior knowledge, then refine 3D bbox.
This paper also regresses 2D bbox and orientation with conventional 2DOD architecture, then get a coarse 3D position, then refine. The approach of generating initial 3D location is similar to FQNet and MonoPSR.
The depth estimation method is practical. The quality aware loss is also easy to implement than IoU net to predict quality of bboxes. However the usefulness of surface feature extraction is doubtful.
The paper still uses Caffe in 2019 is a bit of a shocker.
3D refinement used multibin loss proposed by deep3dbox to replace direct regression. Bin width is the stdev of the error based on training dataset. Sigmoid is used instead of softmax to handle classification of BG cases.