ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
Liming Kuang, Yordanka Velikova, Mahdi Saleh, Jan-Nico Zaech, Danda Pani Paudel, Benjamin Busam

TL;DR
ConceptPose introduces a training-free, zero-shot object pose estimation method that leverages vision-language models to create open-vocabulary 3D concept maps for accurate 6DoF pose estimation.
Contribution
It presents a novel framework that uses vision-language models to perform object pose estimation without any training on specific datasets or objects.
Findings
Achieves state-of-the-art zero-shot relative pose estimation results.
Outperforms dataset-specific methods by 62% in average ADD(-S) score.
Operates without any object or dataset-specific training.
Abstract
Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, outperforming the strongest baseline by a relative 62\% in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
