You Only Pose Once: A Minimalist's Detection Transformer for Monocular RGB Category-level 9D Multi-Object Pose Estimation

Hakjin Lee; Junghoon Seo; Jaehoon Sim

arXiv:2508.14965·cs.CV·March 11, 2026

You Only Pose Once: A Minimalist's Detection Transformer for Monocular RGB Category-level 9D Multi-Object Pose Estimation

Hakjin Lee, Junghoon Seo, Jaehoon Sim

PDF

Open Access

TL;DR

YOPO is a minimalist, single-stage transformer-based framework that unifies object detection and 9-DoF pose estimation from RGB images at the category level, achieving state-of-the-art results without additional data.

Contribution

It introduces YOPO, a novel RGB-only, end-to-end, query-based method that unifies detection and pose estimation without relying on depth or CAD models.

Findings

01

Sets new state-of-the-art on three benchmarks.

02

Achieves 79.6% IoU50 and 54.1% under 10°10cm on REAL275.

03

Outperforms prior RGB-only methods.

Abstract

Accurately recovering the full 9-DoF pose of unseen instances within specific categories from a single RGB image remains a core challenge for robotics and automation. Most existing solutions still rely on pseudo-depth, CAD models, or multi-stage cascades that separate 2D detection from pose estimation. Motivated by the need for a simpler, RGB-only alternative that learns directly at the category level, we revisit a longstanding question: Can object detection and 9-DoF pose estimation be unified with high performance, without any additional data? We show that they can with our method, YOPO, a single-stage, query-based framework that treats category-level 9-DoF estimation as a natural extension of 2D detection. YOPO augments a transformer detector with a lightweight pose head, a bounding-box-conditioned translation module, and a 6D-aware Hungarian matching cost. The model is trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Advanced Neural Network Applications