Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation

Joel Alberto Santos; Zongwei Wu; Xavier Alameda-Pineda; Radu Timofte

arXiv:2511.22025·cs.CV·December 1, 2025

Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation

Joel Alberto Santos, Zongwei Wu, Xavier Alameda-Pineda, Radu Timofte

PDF

Open Access

TL;DR

This paper investigates direct audio-visual alignment for object grounding, proposing a new dataset and benchmarking models, showing that direct audio grounding can be more robust and efficient than transcription-based methods.

Contribution

It introduces a novel audio-based grounding dataset and benchmarks models, demonstrating the feasibility and advantages of direct audio-visual grounding over traditional transcription pipelines.

Findings

01

Direct audio grounding is feasible and sometimes superior to transcription-based methods.

02

Models trained on audio can be more robust to linguistic variability.

03

The new dataset covers diverse objects and accents, supporting broader research.

Abstract

Understanding human instructions is essential for enabling smooth human-robot interaction. In this work, we focus on object grounding, i.e., localizing an object of interest in a visual scene (e.g., an image) based on verbal human instructions. Despite recent progress, a dominant research trend relies on using text as an intermediate representation. These approaches typically transcribe speech to text, extract relevant object keywords, and perform grounding using models pretrained on large text-vision datasets. However, we question both the efficiency and robustness of such transcription-based pipelines. Specifically, we ask: Can we achieve direct audio-visual alignment without relying on text? To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions. We introduce a new audio-based grounding dataset that covers a wide variety of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI