January 2020
tl;dr: Generate synthetic views (virtual cam/viewport) of image to reduce the complexity of 3D MOD neural networks.
The paper builds on the work of monoDIS. The main idea is that the network has to build distinct representations devoted to recognize objects at specific depths and there is little margin of generalization for different depth. This happens as it lacks generalization across depth. As a result, we have to scale up network’s capacity as a function of the depth ranges, and scale up training data as well.
The paper basically proposes a patch-based distance prediction network so that the network only has to learn representation for distance/scale of a very limited range.
This is a classical tradeoff of model/data complexity vs inference complexity. If there is an inherent structure of the image (in autonomous driving camera images, closer object appear at the bottom of the image and further away object are higher up in the image), it can be exploited using row-aware or depth aware convolution (cf M3D RPN). In this paper, they did a row-wise image pyramid of the original image.
The paper also has a good introduction of monocular 3d object detection.