I Speak and You Find: Robust 3D Visual Grounding with Noisy and Ambiguous Speech Inputs

Yu Qi; Lipeng Gu; Honghua Chen; Liangliang Nan; and Mingqiang Wei

arXiv:2506.14495·cs.CV·June 18, 2025

I Speak and You Find: Robust 3D Visual Grounding with Noisy and Ambiguous Speech Inputs

Yu Qi, Lipeng Gu, Honghua Chen, Liangliang Nan, and Mingqiang Wei

PDF

Open Access

TL;DR

SpeechRefer is a novel 3D visual grounding framework that robustly handles noisy and ambiguous speech inputs by integrating acoustic and contrastive modules, significantly improving performance over existing methods.

Contribution

It introduces two innovative modules that reduce reliance on accurate transcriptions, enabling effective 3D visual grounding with imperfect speech-to-text data.

Findings

01

Significant performance improvements on SpeechRefer and speechNr3D datasets.

02

Robustness to transcription errors demonstrated in experiments.

03

Enhanced multimodal system capabilities with noisy speech inputs.

Abstract

Existing 3D visual grounding methods rely on precise text prompts to locate objects within 3D scenes. Speech, as a natural and intuitive modality, offers a promising alternative. Real-world speech inputs, however, often suffer from transcription errors due to accents, background noise, and varying speech rates, limiting the applicability of existing 3DVG methods. To address these challenges, we propose \textbf{SpeechRefer}, a novel 3DVG framework designed to enhance performance in the presence of noisy and ambiguous speech-to-text transcriptions. SpeechRefer integrates seamlessly with xisting 3DVG models and introduces two key innovations. First, the Speech Complementary Module captures acoustic similarities between phonetically related words and highlights subtle distinctions, generating complementary proposal scores from the speech signal. This reduces dependence on potentially…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsContrastive Learning · ALIGN