Demo-Pose: Depth-Monocular Modality Fusion For Object Pose Estimation

Rachit Agarwal; Abhishek Joshi; Sathish Chalasani; Woo Jin Kim

arXiv:2603.27533·cs.CV·March 31, 2026

Demo-Pose: Depth-Monocular Modality Fusion For Object Pose Estimation

Rachit Agarwal, Abhishek Joshi, Sathish Chalasani, Woo Jin Kim

PDF

TL;DR

DeMo-Pose is a hybrid RGB-D architecture that fuses semantic and geometric features for improved real-time object pose estimation without CAD models.

Contribution

It introduces a novel multimodal fusion strategy and Mesh-Point Loss for enhanced geometric reasoning in category-level 3D pose estimation.

Findings

01

Outperforms state-of-the-art methods by 3.2% on 3D IoU

02

Achieves 11.1% improvement in pose accuracy on REAL275

03

Enables real-time inference with improved robustness

Abstract

Object pose estimation is a fundamental task in 3D vision with applications in robotics, AR/VR, and scene understanding. We address the challenge of category-level 9-DoF pose estimation (6D pose + 3Dsize) from RGB-D input, without relying on CAD models during inference. Existing depth-only methods achieve strong results but ignore semantic cues from RGB, while many RGB-D fusion models underperform due to suboptimal cross-modal fusion that fails to align semantic RGB cues with 3D geometric representations. We propose DeMo-Pose, a hybrid architecture that fuses monocular semantic features with depth-based graph convolutional representations via a novel multimodal fusion strategy. To further improve geometric reasoning, we introduce a novel Mesh-Point Loss (MPL) that leverages mesh structure during training without adding inference overhead. Our approach achieves real-time inference and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.