Object Pose Transformer: Unifying Unseen Object Pose Estimation
Weihang Li, Lorenzo Garattoni, Fabien Despinoy, Nassir Navab, Benjamin Busam

TL;DR
This paper introduces Object Pose Transformer, a unified model that combines category-level and unseen object pose estimation, leveraging multi-view geometric reasoning and contrastive embeddings for improved accuracy in RGB and depth inputs.
Contribution
The proposed Object Pose Transformer unifies absolute and relative pose estimation paradigms within a single model, enabling versatile and accurate object pose predictions without relying on predefined taxonomies.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Effectively estimates both absolute and relative object poses.
Operates in RGB-only and RGB-D settings with camera-agnostic design.
Abstract
Learning model-free object pose estimation for unseen instances remains a fundamental challenge in 3D vision. Existing methods typically fall into two disjoint paradigms: category-level approaches predict absolute poses in a canonical space but rely on predefined taxonomies, while relative pose methods estimate cross-view transformations but cannot recover single-view absolute pose. In this work, we propose Object Pose Transformer (\ours{}), a unified feed-forward framework that bridges these paradigms through task factorization within a single model. \ours{} jointly predicts depth, point maps, camera parameters, and normalized object coordinates (NOCS) from RGB inputs, enabling both category-level absolute SA(3) pose and unseen-object relative SE(3) pose. Our approach leverages contrastive object-centric latent embeddings for canonicalization without requiring semantic labels at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Robotics and Sensor-Based Localization
