May 2020
tl;dr: Use pretrained semantic segmentation model to guide PackNet. Use two rounds of training to solve infinite depth issue.
Overall impression
This paper has a different focus than other previous methods which focuses on accurate VO. This paper focuses on accurate depth estimation, especially dynamic object like cars.
Based on the framework of SfM-Learner and SfM, cars moving at the same speed as ego car will be projected to infinite depth. To avoid the infinite depth issue, a simple way to do this is to mask out all dynamic objects and train SfM on static scenes. But this will not give accurate depth on cars during inference.
So the best way to do this is to train SfM on static scenes (in parking lot, e.g.) and during inference, the depth network will generalize to moving cars, as the depth network only takes in a single image.
The infinite depth issue is also tackled in Struct2depth.
Key ideas
- The paper uses a pretrained semantic segmentation network and Pixel adaptive convolution to perform content adaptive upsampling.
- Two-pass training. First pass will train a model with infinite depth issue, which is used to resample the dataset by automatically filtering out sequences with infinite depth predictions that violate a basic geometric prior. If some pixels’s height are significantly below the ground, then we filter out the image. This roughly filters out only 5% of training data.
- Also need to filter out static images, similar to SfM-learner.
- Pixel adaptive convolution helps to preserve spatial details, similar to PackNet and can be used together.
Technical details
- About 40000 images for training.
- Some networks explore semantic information by predicting both from the same network. Towards Scene Understanding uses a conditional decoder to predict either depth or semantic segmentation, by concatenating one channel to the jointly learned features.
Notes