Do We Really Need Scene-specific Pose Encoders?
Yoli Shavit, Ron Ferens

TL;DR
This paper demonstrates that scene-specific pose encoders are unnecessary for visual pose regression, showing that generic image similarity encodings can achieve competitive or superior localization results, including consistent outdoor localization within tight error bounds.
Contribution
It introduces a scene-agnostic approach using pre-computed image similarity encodings for pose regression, challenging the necessity of scene-specific encoders.
Findings
Encodings from image retrieval models suffice for accurate pose regression.
The proposed method can outperform state-of-the-art scene-specific models.
Achieves outdoor localization within 2 meters and 5 degrees consistently.
Abstract
Visual pose regression models estimate the camera pose from a query image with a single forward pass. Current models learn pose encoding from an image using deep convolutional networks which are trained per scene. The resulting encoding is typically passed to a multi-layer perceptron in order to regress the pose. In this work, we propose that scene-specific pose encoders are not required for pose regression and that encodings trained for visual similarity can be used instead. In order to test our hypothesis, we take a shallow architecture of several fully connected layers and train it with pre-computed encodings from a generic image retrieval model. We find that these encodings are not only sufficient to regress the camera pose, but that, when provided to a branching fully connected architecture, a trained model can achieve competitive results and even surpass current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
