Towards Two-view 6D Object Pose Estimation: A Comparative Study on Fusion Strategy
Jun Wu, Lilu Liu, Yue Wang, Rong Xiong

TL;DR
This paper investigates fusion strategies for RGB-based 6D object pose estimation, proposing a framework that learns implicit 3D information from RGB images and demonstrating that mid-fusion yields the best results, narrowing the gap with RGBD methods.
Contribution
It introduces a novel framework that learns implicit 3D features from RGB images and compares three fusion strategies, identifying mid-fusion as the most effective for pose estimation.
Findings
Mid-fusion approach best restores 3D keypoints.
Proposed method outperforms existing RGB-based methods.
Achieves comparable results with RGBD-based methods.
Abstract
Current RGB-based 6D object pose estimation methods have achieved noticeable performance on datasets and real world applications. However, predicting 6D pose from single 2D image features is susceptible to disturbance from changing of environment and textureless or resemblant object surfaces. Hence, RGB-based methods generally achieve less competitive results than RGBD-based methods, which deploy both image features and 3D structure features. To narrow down this performance gap, this paper proposes a framework for 6D object pose estimation that learns implicit 3D information from 2 RGB images. Combining the learned 3D information and 2D image features, we establish more stable correspondence between the scene and the object models. To seek for the methods best utilizing 3D information from RGB inputs, we conduct an investigation on three different approaches, including Early- Fusion,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Robot Manipulation and Learning · Human Pose and Action Recognition
