Bridging High-Quality Audio and Video via Language for Sound Effects   Retrieval from Visual Queries

Julia Wilkins; Justin Salamon; Magdalena Fuentes; Juan Pablo Bello,; Oriol Nieto

arXiv:2308.09089·cs.SD·August 21, 2023

Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries

Julia Wilkins, Justin Salamon, Magdalena Fuentes, Juan Pablo Bello,, Oriol Nieto

PDF

Open Access

TL;DR

This paper introduces a novel multimodal framework that leverages language and vision models to retrieve high-quality sound effects from video frames, outperforming existing methods and generalizing well across data qualities.

Contribution

The authors propose an automatic data curation pipeline using large language and vision models, combined with contrastive learning, to improve high-quality SFX retrieval from visual queries.

Findings

01

Outperforms baselines on HQ SFX retrieval task

02

Generalizes well from clean to in-the-wild data

03

User study shows 67% preference for system-retrieved SFX

Abstract

Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task, and relies heavily on the quality and completeness of text metadata. Retrieving high-quality (HQ) SFX using a video frame directly as the query is an attractive alternative, removing the reliance on text metadata and providing a low barrier to entry for non-experts. Due to the lack of HQ audio-visual training data, previous work on audio-visual retrieval relies on YouTube (in-the-wild) videos of varied quality for training, where the audio is often noisy and the video of amateur quality. As such it is unclear whether these systems would generalize to the task of matching HQ audio to production-quality video. To address this, we propose a multimodal framework for recommending HQ SFX given a video frame by (1) leveraging large language models and foundational vision-language models to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Diverse Musicological Studies

Methodsfail