Transcrib3D: 3D Referring Expression Resolution through Large Language   Models

Jiading Fang; Xiangshan Tan; Shengjie Lin; Igor Vasiljevic; Vitor; Guizilini; Hongyuan Mei; Rares Ambrus; Gregory Shakhnarovich; Matthew R; Walter

arXiv:2404.19221·cs.CV·May 1, 2024

Transcrib3D: 3D Referring Expression Resolution through Large Language Models

Jiading Fang, Xiangshan Tan, Shengjie Lin, Igor Vasiljevic, Vitor, Guizilini, Hongyuan Mei, Rares Ambrus, Gregory Shakhnarovich, Matthew R, Walter

PDF

Open Access

TL;DR

Transcrib3D leverages large language models and 3D detection to interpret natural language references in 3D environments, achieving state-of-the-art results and enabling robots to perform complex tasks.

Contribution

It introduces a novel approach combining 3D detection with LLMs using text as the unifying medium, reducing the need for extensive annotated 3D data.

Findings

01

Achieves state-of-the-art 3D reference resolution performance.

02

Enables real robot pick-and-place tasks with challenging referring expressions.

03

Proposes self-correction fine-tuning for improved zero-shot performance.

Abstract

If robots are to work effectively alongside people, they must be able to interpret natural language references to objects in their 3D environment. Understanding 3D referring expressions is challenging -- it requires the ability to both parse the 3D structure of the scene and correctly ground free-form language in the presence of distraction and clutter. We introduce Transcrib3D, an approach that brings together 3D detection methods and the emergent reasoning capabilities of large language models (LLMs). Transcrib3D uses text as the unifying medium, which allows us to sidestep the need to learn shared representations connecting multi-modal inputs, which would require massive amounts of annotated 3D data. As a demonstration of its effectiveness, Transcrib3D achieves state-of-the-art results on 3D reference resolution benchmarks, with a great leap in performance from previous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis