Learning Multi-Scene Absolute Pose Regression with Transformers
Yoli Shavit, Ron Ferens, Yosi Keller

TL;DR
This paper introduces a Transformer-based approach for multi-scene absolute camera pose regression, enabling the model to effectively localize across multiple environments using self-attention mechanisms.
Contribution
The work presents a novel Transformer architecture for multi-scene pose regression, improving over previous methods by better capturing scene-invariant features.
Findings
Outperforms existing multi-scene pose regressors
Surpasses state-of-the-art single-scene methods
Effective on indoor and outdoor datasets
Abstract
Absolute camera pose regressors estimate the position and orientation of a camera from the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron head is trained with images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended for learning multiple scenes by replacing the MLP head with a set of fully connected layers. In this work, we propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention and decoders transform latent features and scenes encoding into candidate pose predictions. This mechanism allows our model to focus on general features that are informative for localization while embedding multiple scenes in parallel. We evaluate our method on commonly benchmarked indoor and outdoor datasets and show that it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Human Pose and Action Recognition
