MVTOP: Multi-View Transformer-based Object Pose-Estimation

Lukas Ranftl; Felix Brendel; Bertram Drost; Carsten Steger

arXiv:2508.03243·cs.CV·March 24, 2026

MVTOP: Multi-View Transformer-based Object Pose-Estimation

Lukas Ranftl, Felix Brendel, Bertram Drost, Carsten Steger

PDF

TL;DR

MVTOP is a transformer-based multi-view approach for rigid object pose estimation that effectively resolves pose ambiguities by early fusion of view-specific features, outperforming existing methods on synthetic and real datasets.

Contribution

The paper introduces MVTOP, a novel multi-view transformer model that models geometry via lines of sight and resolves pose ambiguities better than existing approaches.

Findings

01

Outperforms single-view and existing multi-view methods on synthetic dataset.

02

Achieves competitive results on YCB-V dataset.

03

Can reliably resolve pose ambiguities in multi-view scenarios.

Abstract

We present MVTOP, a novel transformer-based method for multi-view rigid object pose estimation. Through an early fusion of the view-specific features, our method can resolve pose ambiguities that would be impossible to solve with a single view or with a post-processing of single-view poses. MVTOP models the multi-view geometry via lines of sight that emanate from the respective camera centers. While the method assumes the camera interior and relative orientations are known for a particular scene, they can vary for each inference. This makes the method versatile. The use of the lines of sight enables MVTOP to correctly predict the correct pose with the merged multi-view information. To show the model's capabilities, we provide a synthetic data set that can only be solved with such holistic multi-view approaches since the poses in the dataset cannot be solved with just one view. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.