OSCAR: Open-Set CAD Retrieval from a Language Prompt and a Single Image
Tessa Pulli, Jean-Baptiste Weibel, Peter H\"onig, Matthias Hirschmanner, Markus Vincze, Andreas Holzinger

TL;DR
OSCAR is a training-free open-set 3D object retrieval method that uses language prompts and a single image to identify and source models for 6D pose estimation, outperforming state-of-the-art benchmarks.
Contribution
It introduces a novel, training-free approach combining multi-modal embeddings and multi-view renderings for open-set CAD model retrieval from a single image and language prompt.
Findings
Outperforms state-of-the-art on MI3DOR benchmark
Achieves 90.48% average precision on YCB-V dataset
Enables effective 6D pose estimation using retrieved models
Abstract
6D object pose estimation plays a crucial role in scene understanding for applications such as robotics and augmented reality. To support the needs of ever-changing object sets in such context, modern zero-shot object pose estimators were developed to not require object-specific training but only rely on CAD models. Such models are hard to obtain once deployed, and a continuously changing and growing set of objects makes it harder to reliably identify the instance model of interest. To address this challenge, we introduce an Open-Set CAD Retrieval from a Language Prompt and a Single Image (OSCAR), a novel training-free method that retrieves a matching object model from an unlabeled 3D object database. During onboarding, OSCAR generates multi-view renderings of database models and annotates them with descriptive captions using an image captioning model. At inference, GroundedSAM detects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · 3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization
