You Only Speak Once to See
Wenhao Yang, Jianguo Wei, Wenhuan Lu, Lei Li

TL;DR
This paper introduces YOSS, a novel method that uses audio cues to improve object grounding in images, demonstrating the potential of audio-visual integration for enhanced scene understanding.
Contribution
YOSS is the first approach to leverage audio for object grounding by integrating pre-trained audio and visual models with contrastive learning.
Findings
Audio guidance improves object grounding accuracy.
Multi-modal alignment enhances robustness in scene understanding.
Potential applications in robotics and computer vision systems.
Abstract
Grounding objects in images using visual cues is a well-established approach in computer vision, yet the potential of audio as a modality for object recognition and grounding remains underexplored. We introduce YOSS, "You Only Speak Once to See," to leverage audio for grounding objects in visual scenes, termed Audio Grounding. By integrating pre-trained audio models with visual models using contrastive learning and multi-modal alignment, our approach captures speech commands or descriptions and maps them directly to corresponding objects within images. Experimental results indicate that audio guidance can be effectively applied to object grounding, suggesting that incorporating audio guidance may enhance the precision and robustness of current object grounding methods and improve the performance of robotic systems and computer vision applications. This finding opens new possibilities…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Multimodal Machine Learning Applications · Music and Audio Processing
MethodsContrastive Learning
