Demo-Pose: Depth-Monocular Modality Fusion For Object Pose Estimation
Rachit Agarwal, Abhishek Joshi, Sathish Chalasani, Woo Jin Kim

TL;DR
DeMo-Pose is a hybrid RGB-D architecture that fuses semantic and geometric features for improved real-time object pose estimation without CAD models.
Contribution
It introduces a novel multimodal fusion strategy and Mesh-Point Loss for enhanced geometric reasoning in category-level 3D pose estimation.
Findings
Outperforms state-of-the-art methods by 3.2% on 3D IoU
Achieves 11.1% improvement in pose accuracy on REAL275
Enables real-time inference with improved robustness
Abstract
Object pose estimation is a fundamental task in 3D vision with applications in robotics, AR/VR, and scene understanding. We address the challenge of category-level 9-DoF pose estimation (6D pose + 3Dsize) from RGB-D input, without relying on CAD models during inference. Existing depth-only methods achieve strong results but ignore semantic cues from RGB, while many RGB-D fusion models underperform due to suboptimal cross-modal fusion that fails to align semantic RGB cues with 3D geometric representations. We propose DeMo-Pose, a hybrid architecture that fuses monocular semantic features with depth-based graph convolutional representations via a novel multimodal fusion strategy. To further improve geometric reasoning, we introduce a novel Mesh-Point Loss (MPL) that leverages mesh structure during training without adding inference overhead. Our approach achieves real-time inference and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
