Transcrib3D: 3D Referring Expression Resolution through Large Language Models
Jiading Fang, Xiangshan Tan, Shengjie Lin, Igor Vasiljevic, Vitor, Guizilini, Hongyuan Mei, Rares Ambrus, Gregory Shakhnarovich, Matthew R, Walter

TL;DR
Transcrib3D leverages large language models and 3D detection to interpret natural language references in 3D environments, achieving state-of-the-art results and enabling robots to perform complex tasks.
Contribution
It introduces a novel approach combining 3D detection with LLMs using text as the unifying medium, reducing the need for extensive annotated 3D data.
Findings
Achieves state-of-the-art 3D reference resolution performance.
Enables real robot pick-and-place tasks with challenging referring expressions.
Proposes self-correction fine-tuning for improved zero-shot performance.
Abstract
If robots are to work effectively alongside people, they must be able to interpret natural language references to objects in their 3D environment. Understanding 3D referring expressions is challenging -- it requires the ability to both parse the 3D structure of the scene and correctly ground free-form language in the presence of distraction and clutter. We introduce Transcrib3D, an approach that brings together 3D detection methods and the emergent reasoning capabilities of large language models (LLMs). Transcrib3D uses text as the unifying medium, which allows us to sidestep the need to learn shared representations connecting multi-modal inputs, which would require massive amounts of annotated 3D data. As a demonstration of its effectiveness, Transcrib3D achieves state-of-the-art results on 3D reference resolution benchmarks, with a great leap in performance from previous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
