VLM6D: VLM based 6Dof Pose Estimation based on RGB-D Images
Md Selim Sarowar, Sungho Kim

TL;DR
VLM6D introduces a dual-stream RGB-D architecture combining Vision Transformer and PointNet++ encoders for robust 6D object pose estimation, excelling under challenging real-world conditions.
Contribution
The paper presents a novel dual-stream architecture that fuses visual and geometric features for improved 6D pose estimation from RGB-D data, enhancing robustness and accuracy.
Findings
Achieved new state-of-the-art results on Occluded-LineMOD.
Demonstrated robustness against textureless, occluded, and lighting-variable scenarios.
Validated superior performance over existing methods.
Abstract
The primary challenge in computer vision is precisely calculating the pose of 6D objects, however many current approaches are still fragile and have trouble generalizing from synthetic data to real-world situations with fluctuating lighting, textureless objects, and significant occlusions. To address these limitations, VLM6D, a novel dual-stream architecture that leverages the distinct strengths of visual and geometric data from RGB-D input for robust and precise pose estimation. Our framework uniquely integrates two specialized encoders: a powerful, self-supervised Vision Transformer (DINOv2) processes the RGB modality, harnessing its rich, pre-trained understanding of visual grammar to achieve remarkable resilience against texture and lighting variations. Concurrently, a PointNet++ encoder processes the 3D point cloud derived from depth data, enabling robust geometric reasoning that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Robotics and Sensor-Based Localization · 3D Shape Modeling and Analysis
