Learning-AI

Deep Structured Crosswalk: End-to-End Deep Structured Models for Drawing Crosswalks

August 2020

tl;dr: Extract structured crosswalk from BEV images.

Overall impression

There are several works from Uber ATG that extracts polyline representation based on BEV maps.

This work predict deep feature maps, and use energy maximization to perform inference. Not exactly end to end.

Deep Structured Crosswalk can be directly applied to extract road boundaries. Deep Boundary Extractor is inspired by Deep Structured Crosswalk and uses conv-Snake to predict in an autoregressive fashion.

In a sense, this work is basically edge-aware semantic segmentation. The structured prediction module converts the unstructured semantic segmentation results into a structured presentation.

Key ideas

Input: lidar + BEV cam, 4 ch. CenterLine from OSM (OpenStreetMap) needed for prediction and dataset generation.
Output (deep features):
- Inverse distance transform (DT, 1 ch)
- Semantic segmentation (1 ch)
- Predicted Alignment angle (Dilated normals, 2 ch) –> This is improved to direction map in Deep Boundary Extractor.
Inference with Structured prediction module.
- finds the best two boundaries x1 and x2 along with the best angle β by maximizing a structured energy function.
- Draw two orientated and parallel lines so that it tightly encloses the maximum number of segmented points
- It also encourages draw the lines along the DT boundary.

Technical details

96% accuracy.
4 cm/pixel resolution
Distance transforms a natural way to “blur” feature locations geometrically (source). This is another way to densify sparse GT as compared to Gaussian blurring as in CenterNet.
Lidar helps, and multiple passes helps. Multiple BEV is better than one pass BEV + lidar.
Ablation study
- Alignment angle helps quite a bit and without it there is significant drop in performance. Also this is predicted quite accurately as substitute the predicted angle with oracle (GT) angle does not lead to performance increase.
- Semantic segmentation alone is a strong baseline, and already achieves quite good results.
GT Noise: They compare the noise in human annotation of the ground truth by annotating 100 intersections with several annotators. About 5% error in IoU.
Uses integral accumulators (images), calculations can be optimized to avoid exhaustive evaluation.

Notes

What is the image size (HxW) of each patch?
This crosswalk extraction task can be easily extended to road boundary extraction and stopline extraction.