Coarse-to-Fine Multi-Scene Pose Regression with Transformers
Yoli Shavit, Ron Ferens, Yosi Keller

TL;DR
This paper introduces a Transformer-based approach for multi-scene camera pose regression, enabling the model to focus on relevant features and outperform existing methods on standard benchmarks.
Contribution
The work presents a novel Transformer architecture with mixed classification-regression for multi-scene pose estimation, improving accuracy over prior models.
Findings
Outperforms state-of-the-art single-scene regressors
Effective multi-scene localization on benchmark datasets
Transformer-based architecture enhances feature focus
Abstract
Absolute camera pose regressors estimate the position and orientation of a camera given the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron (MLP) head is trained using images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended to learn multiple scenes by replacing the MLP head with a set of fully connected layers. In this work, we propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention and decoders transform latent features and scenes encoding into pose predictions. This allows our model to focus on general features that are informative for localization, while embedding multiple scenes in parallel. We extend our previous MS-Transformer approach \cite{shavit2021learning} by introducing a mixed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Human Pose and Action Recognition
MethodsFocus
