VLM6D: VLM based 6Dof Pose Estimation based on RGB-D Images

Md Selim Sarowar; Sungho Kim

arXiv:2511.00120·cs.CV·November 4, 2025

VLM6D: VLM based 6Dof Pose Estimation based on RGB-D Images

Md Selim Sarowar, Sungho Kim

PDF

Open Access

TL;DR

VLM6D introduces a dual-stream RGB-D architecture combining Vision Transformer and PointNet++ encoders for robust 6D object pose estimation, excelling under challenging real-world conditions.

Contribution

The paper presents a novel dual-stream architecture that fuses visual and geometric features for improved 6D pose estimation from RGB-D data, enhancing robustness and accuracy.

Findings

01

Achieved new state-of-the-art results on Occluded-LineMOD.

02

Demonstrated robustness against textureless, occluded, and lighting-variable scenarios.

03

Validated superior performance over existing methods.

Abstract

The primary challenge in computer vision is precisely calculating the pose of 6D objects, however many current approaches are still fragile and have trouble generalizing from synthetic data to real-world situations with fluctuating lighting, textureless objects, and significant occlusions. To address these limitations, VLM6D, a novel dual-stream architecture that leverages the distinct strengths of visual and geometric data from RGB-D input for robust and precise pose estimation. Our framework uniquely integrates two specialized encoders: a powerful, self-supervised Vision Transformer (DINOv2) processes the RGB modality, harnessing its rich, pre-trained understanding of visual grammar to achieve remarkable resilience against texture and lighting variations. Concurrently, a PointNet++ encoder processes the 3D point cloud derived from depth data, enabling robust geometric reasoning that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Robotics and Sensor-Based Localization · 3D Shape Modeling and Analysis