September 2020
tl;dr: Use self-supervised optical flow loss to pretrain of a structure encoder for monocular depth estimator on uncalibrated videos.
Overall impression
The paper provides a new direction to improve monocular depth estimation. Conventional ImageNet pretraining helps more with classification tasks than location-aware tasks such as object detection and depth estimation, as spatial information got discarded.
The core of the algorithm is still photometric loss, and has inherent limitations.
One main difference between optical flow and depth estimation task is that optical flow estimation does not care about calibration.
The idea of encoding motion between two different frames into a latent vector is very similar to PoseNet in SfM-learner and Struct2Depth and more flexible.
The idea of using geometric pretraining to improve monoDepth or mono3D is similar to CubifAE-3D.
Key ideas
- Pretraining Architecture
- A conditional encoder-decoder to separate structure information and motion information
- The PoseNet takes in two consecutive images and output a 128-dim latent motion vector. –> This is similar to GeoNet, but instead of regressing every single variable in the pose vector, it encodes the motion (ego motion and other motions of the car) into a latent vector. This is supposed to be more versatile.
- The flow decoder learns the flow, and uses the self-supervised flow loss.
- Only the structure encoder is kept as the backbone. It learns general structures and does not explicitly estimate depth and pose.
- Then the backbone (structure encoder) is finetuned for depth estimation.
Technical details
- The reason to use optical flow as the pretraining task as optical flow only cares about pixel correspondence and does not care about extrinsics/intrinsics and thus we can use uncalibrated videos for this pretraining.
Notes