VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation
Md Selim Sarowar, Sungho Kim

TL;DR
This paper compares CLIP and DINOv2 vision models for 3D pose estimation in robotic grasping, highlighting their complementary strengths in semantics and geometry through extensive benchmark experiments.
Contribution
It provides a comprehensive analysis of how different vision foundation models perform in 3D pose estimation, guiding model selection for robotic manipulation tasks.
Findings
CLIP excels in semantic understanding and language grounding.
DINOv2 offers superior dense geometric features.
Both models show competitive performance with specific strengths.
Abstract
Vision Foundation Models (VFMs) and Vision Language Models (VLMs) have revolutionized computer vision by providing rich semantic and geometric representations. This paper presents a comprehensive visual comparison between CLIP based and DINOv2 based approaches for 3D pose estimation in hand object grasping scenarios. We evaluate both models on the task of 6D object pose estimation and demonstrate their complementary strengths: CLIP excels in semantic understanding through language grounding, while DINOv2 provides superior dense geometric features. Through extensive experiments on benchmark datasets, we show that CLIP based methods achieve better semantic consistency, while DINOv2 based approaches demonstrate competitive performance with enhanced geometric precision. Our analysis provides insights for selecting appropriate vision models for robotic manipulation and grasping, picking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Hand Gesture Recognition Systems
