OMNI-PoseX: A Fast Vision Model for 6D Object Pose Estimation in Embodied Tasks

Michael Zhang; Wei Ying; Fangwen Chen; Shifeng Bai; Hanwen Kang

arXiv:2604.02759·cs.RO·April 6, 2026

OMNI-PoseX: A Fast Vision Model for 6D Object Pose Estimation in Embodied Tasks

Michael Zhang, Wei Ying, Fangwen Chen, Shifeng Bai, Hanwen Kang

PDF

TL;DR

OMNI-PoseX is a fast, open-vocabulary vision model that achieves state-of-the-art 6D object pose estimation in real-time, suitable for robotic applications in open-world environments.

Contribution

It introduces a novel architecture that combines open-vocabulary perception with SO(3)-aware pose prediction, improving generalization and stability.

Findings

01

Achieves SOTA accuracy in 6D pose estimation

02

Demonstrates real-time performance in robotic grasping

03

Generalizes well to unseen objects in open-world settings

Abstract

Accurate 6D object pose estimation is a fundamental capability for embodied agents, yet remains highly challenging in open-world environments. Many existing methods often rely on closed-set assumptions or geometry-agnostic regression schemes, limiting their generalization, stability, and real-time applicability in robotic systems. We present OMNI-PoseX, a vision foundation model that introduces a novel network architecture unifying open-vocabulary perception with an SO(3)-aware reflected flow matching pose predictor. The architecture decouples object-level understanding from geometry-consistent rotation inference, and employs a lightweight multi-modal fusion strategy that conditions rotation-sensitive geometric features on compact semantic embeddings, enabling efficient and stable 6D pose estimation. To enhance robustness and generalization, the model is trained on large-scale 6D pose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.