Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries
Julia Wilkins, Justin Salamon, Magdalena Fuentes, Juan Pablo Bello,, Oriol Nieto

TL;DR
This paper introduces a novel multimodal framework that leverages language and vision models to retrieve high-quality sound effects from video frames, outperforming existing methods and generalizing well across data qualities.
Contribution
The authors propose an automatic data curation pipeline using large language and vision models, combined with contrastive learning, to improve high-quality SFX retrieval from visual queries.
Findings
Outperforms baselines on HQ SFX retrieval task
Generalizes well from clean to in-the-wild data
User study shows 67% preference for system-retrieved SFX
Abstract
Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task, and relies heavily on the quality and completeness of text metadata. Retrieving high-quality (HQ) SFX using a video frame directly as the query is an attractive alternative, removing the reliance on text metadata and providing a low barrier to entry for non-experts. Due to the lack of HQ audio-visual training data, previous work on audio-visual retrieval relies on YouTube (in-the-wild) videos of varied quality for training, where the audio is often noisy and the video of amateur quality. As such it is unclear whether these systems would generalize to the task of matching HQ audio to production-quality video. To address this, we propose a multimodal framework for recommending HQ SFX given a video frame by (1) leveraging large language models and foundational vision-language models to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Diverse Musicological Studies
Methodsfail
