Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation
Joel Alberto Santos, Zongwei Wu, Xavier Alameda-Pineda, Radu Timofte

TL;DR
This paper investigates direct audio-visual alignment for object grounding, proposing a new dataset and benchmarking models, showing that direct audio grounding can be more robust and efficient than transcription-based methods.
Contribution
It introduces a novel audio-based grounding dataset and benchmarks models, demonstrating the feasibility and advantages of direct audio-visual grounding over traditional transcription pipelines.
Findings
Direct audio grounding is feasible and sometimes superior to transcription-based methods.
Models trained on audio can be more robust to linguistic variability.
The new dataset covers diverse objects and accents, supporting broader research.
Abstract
Understanding human instructions is essential for enabling smooth human-robot interaction. In this work, we focus on object grounding, i.e., localizing an object of interest in a visual scene (e.g., an image) based on verbal human instructions. Despite recent progress, a dominant research trend relies on using text as an intermediate representation. These approaches typically transcribe speech to text, extract relevant object keywords, and perform grounding using models pretrained on large text-vision datasets. However, we question both the efficiency and robustness of such transcription-based pipelines. Specifically, we ask: Can we achieve direct audio-visual alignment without relying on text? To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions. We introduce a new audio-based grounding dataset that covers a wide variety of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI
