Open-vocabulary object 6D pose estimation
Jaime Corsetti, Davide Boscaini, Changjae Oh, Andrea Cavallaro, Fabio, Poiesi

TL;DR
This paper proposes a novel open-vocabulary 6D object pose estimation method that uses textual prompts and vision-language models, eliminating the need for object models at inference and generalizing to new objects.
Contribution
It introduces a new setting for 6D pose estimation using only textual prompts and develops a fusion strategy with vision-language models to handle novel objects.
Findings
Outperforms existing methods on the new benchmark datasets.
Effectively generalizes to unseen object categories.
Achieves accurate pose estimation without object-specific models.
Abstract
We introduce the new setting of open-vocabulary object 6D pose estimation, in which a textual prompt is used to specify the object of interest. In contrast to existing approaches, in our setting (i) the object of interest is specified solely through the textual prompt, (ii) no object model (e.g., CAD or video sequence) is required at inference, and (iii) the object is imaged from two RGBD viewpoints of different scenes. To operate in this setting, we introduce a novel approach that leverages a Vision-Language Model to segment the object of interest from the scenes and to estimate its relative 6D pose. The key of our approach is a carefully devised strategy to fuse object-level information provided by the prompt with local image features, resulting in a feature space that can generalize to novel concepts. We validate our approach on a new benchmark based on two popular datasets, REAL275…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · 3D Surveying and Cultural Heritage · Image Processing and 3D Reconstruction
