OSCAR: Open-Set CAD Retrieval from a Language Prompt and a Single Image

Tessa Pulli; Jean-Baptiste Weibel; Peter H\"onig; Matthias Hirschmanner; Markus Vincze; Andreas Holzinger

arXiv:2601.07333·cs.CV·January 13, 2026

OSCAR: Open-Set CAD Retrieval from a Language Prompt and a Single Image

Tessa Pulli, Jean-Baptiste Weibel, Peter H\"onig, Matthias Hirschmanner, Markus Vincze, Andreas Holzinger

PDF

Open Access

TL;DR

OSCAR is a training-free open-set 3D object retrieval method that uses language prompts and a single image to identify and source models for 6D pose estimation, outperforming state-of-the-art benchmarks.

Contribution

It introduces a novel, training-free approach combining multi-modal embeddings and multi-view renderings for open-set CAD model retrieval from a single image and language prompt.

Findings

01

Outperforms state-of-the-art on MI3DOR benchmark

02

Achieves 90.48% average precision on YCB-V dataset

03

Enables effective 6D pose estimation using retrieved models

Abstract

6D object pose estimation plays a crucial role in scene understanding for applications such as robotics and augmented reality. To support the needs of ever-changing object sets in such context, modern zero-shot object pose estimators were developed to not require object-specific training but only rely on CAD models. Such models are hard to obtain once deployed, and a continuously changing and growing set of objects makes it harder to reliably identify the instance model of interest. To address this challenge, we introduce an Open-Set CAD Retrieval from a Language Prompt and a Single Image (OSCAR), a novel training-free method that retrieves a matching object model from an unlabeled 3D object database. During onboarding, OSCAR generates multi-view renderings of database models and annotates them with descriptive captions using an image captioning model. At inference, GroundedSAM detects…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · 3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization