May 2020
tl;dr: 3D structure > SIFT + 5pt solver > Neural network based.
Overall impression
The paper builds on Understanding APR and demonstrated that the good old SIFT + 5 pt solver is still the state of the art for relative pose estimation without 3D structures. 3D structures can achieve better results but require scene specific 3D modeling and lack generalization.
Relative localization has three steps: feature extraction, find matching, and calculate essential matrices (or R and t).
- Feature extraction (SIFT or DL)
- Matching (concat or DL neighborhood consensus matching layer)
- Solve for pose (5 pt solver, DL regression)
The bottleneck of DL based approach is the matching and pose regression part. DL regression cannot generalize to new scenes as DL cannot properly learn implicit matching by regression network.
However, even if we replace the pose regression with 5 pt solver, it still cannot beat SIFT + 5 pt solver. This is mainly due to that current CNN features are coarsely localized on the image, that is, the features from the later layers are not mapped to a single pixel but rather an image patch. All the self-supervised keypoints learner feature based methods still cannot beat SIFT consistently. I wrote a blog about self-supervised keypoint learning here. As pointed out in an open review for KP2D, “the problem is old yet not fully solved yet, because handcrafted SIFT is still winning the benchmarks.”
Key ideas
- Direct approach:
- Current SOTA of visual localization is based on 3D information. The representation are scene-specific and do not generalize to unseen scenes.
- Indirect approach:
- A more flexible way is relative pose estimation based on image retrieval first. This involves building a dataset based on posed images. It is more scalable and generalizable. The image retrieval can be done efficiently with compact image level descriptors (such as Dense-VLAD CVPR 2015).
- Regressing R and t direclty needs finetuning hyperparameters to balance the two loss terms based on different scenes (outdoor vs indoors). Regressing essential matrix naturally handles this issue. It is a natural way to weigh the various loss function depending on the final metric. (cf. SMOKE)
- This perhaps can be extended to still regressing the quaternion and translation vectors but construct an essential matrix and compare with gt. This way we don’t have to worry about projecting the predicted matrix to the manifold of essential matrix afterwards. d
Technical details
- Interpolating pose based on Nearest neighbor does not have the best results. The scenes have to be a min distance apart to avoid too much correlation. The paper starts with top ranked images and picks images within [3, 50] meters to all previous selected image for outdoor scenes.
- Autonomous driving is largely planar motion and should be easier than a full blown 6 DoF localization.
- The essential matrix is regressed with a FC layer and may not be valid. The DLA regressed E matrix is then projected to a valid essential matrix space (manifold) by replacing the first two singular values by their mean and set the smallest singular value to 0. $\Sigma = diag(\sigma_1, \sigma_2, \sigma_3) \rightarrow diag(1, 1, 0)$.
- Absolute pose estimation is scene specific as it depends on the coordinate system used.
Notes