Object Pose Transformer: Unifying Unseen Object Pose Estimation

Weihang Li; Lorenzo Garattoni; Fabien Despinoy; Nassir Navab; Benjamin Busam

arXiv:2603.23370·cs.CV·March 25, 2026

Object Pose Transformer: Unifying Unseen Object Pose Estimation

Weihang Li, Lorenzo Garattoni, Fabien Despinoy, Nassir Navab, Benjamin Busam

PDF

Open Access

TL;DR

This paper introduces Object Pose Transformer, a unified model that combines category-level and unseen object pose estimation, leveraging multi-view geometric reasoning and contrastive embeddings for improved accuracy in RGB and depth inputs.

Contribution

The proposed Object Pose Transformer unifies absolute and relative pose estimation paradigms within a single model, enabling versatile and accurate object pose predictions without relying on predefined taxonomies.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Effectively estimates both absolute and relative object poses.

03

Operates in RGB-only and RGB-D settings with camera-agnostic design.

Abstract

Learning model-free object pose estimation for unseen instances remains a fundamental challenge in 3D vision. Existing methods typically fall into two disjoint paradigms: category-level approaches predict absolute poses in a canonical space but rely on predefined taxonomies, while relative pose methods estimate cross-view transformations but cannot recover single-view absolute pose. In this work, we propose Object Pose Transformer (\ours{}), a unified feed-forward framework that bridges these paradigms through task factorization within a single model. \ours{} jointly predicts depth, point maps, camera parameters, and normalized object coordinates (NOCS) from RGB inputs, enabling both category-level absolute SA(3) pose and unseen-object relative SE(3) pose. Our approach leverages contrastive object-centric latent embeddings for canonicalization without requiring semantic labels at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Robotics and Sensor-Based Localization