January 2021
tl;dr: predict BEV semantic maps from a single monocular video.
Overall impression
Previous SOTA PyrOccNet and Lift splat shoot studies how to combine synchronized images from multiple cameras into a coherent 360 deg BEV map. BEV-feat-stitching try to stitch monocular video into a coherent BEV map. This process also requires knowledge of the camera pose sequence.
The mapping of the intermediate feature map resembles that of feature-metric mono depth and feature-metric distance in 3DSSD.
To be honest the results do not look as clean as PyrOccNet. Future work may be to combine these two trends, from both BEV-feat-stitching and PyrOccNet.
This paper has a follow-up work STSU for structured BEV perception.
Key ideas
- Takes in mono video as input
- BEV temporal aggregation module
- Project the features to BEV space
- BEV aggregation (BEV feature stitching) with camera pose.
- Aggregation is done in a unified BEV grid (extended BEV)
- Intermediate feature supervision in camera space with reprojected BEV GT
- Single frame object supervision
- Multiple frames static class
Technical details
- 200x200 pixels, 0.25 m/pixel, 50m x 50m
- The addition of dynamic classes helps with the static classes.
Notes
- The evaluation is still in mIoU, treating the problem as a semantic segmentation issue. However we perhaps should introduce the idea of instance segmentation for better prediction and planning.
- Stitching may have some noise with extrinsics and pose estimation and deep learning helps smooths this out.