You Only Pose Once: A Minimalist's Detection Transformer for Monocular RGB Category-level 9D Multi-Object Pose Estimation
Hakjin Lee, Junghoon Seo, Jaehoon Sim

TL;DR
YOPO is a minimalist, single-stage transformer-based framework that unifies object detection and 9-DoF pose estimation from RGB images at the category level, achieving state-of-the-art results without additional data.
Contribution
It introduces YOPO, a novel RGB-only, end-to-end, query-based method that unifies detection and pose estimation without relying on depth or CAD models.
Findings
Sets new state-of-the-art on three benchmarks.
Achieves 79.6% IoU50 and 54.1% under 10°10cm on REAL275.
Outperforms prior RGB-only methods.
Abstract
Accurately recovering the full 9-DoF pose of unseen instances within specific categories from a single RGB image remains a core challenge for robotics and automation. Most existing solutions still rely on pseudo-depth, CAD models, or multi-stage cascades that separate 2D detection from pose estimation. Motivated by the need for a simpler, RGB-only alternative that learns directly at the category level, we revisit a longstanding question: Can object detection and 9-DoF pose estimation be unified with high performance, without any additional data? We show that they can with our method, YOPO, a single-stage, query-based framework that treats category-level 9-DoF estimation as a natural extension of 2D detection. YOPO augments a transformer detector with a lightweight pose head, a bounding-box-conditioned translation module, and a 6D-aware Hungarian matching cost. The model is trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Advanced Neural Network Applications
