FoleyBench: A Benchmark For Video-to-Audio Models
Satvik Dixit, Koichi Saito, Zhi Zhong, Yuki Mitsufuji, Chris Donahue

TL;DR
FoleyBench is a new large-scale benchmark dataset designed specifically for evaluating video-to-audio models in Foley sound generation, addressing the mismatch in existing datasets and enabling more accurate assessment of audio-visual alignment and sound quality.
Contribution
This paper introduces FoleyBench, the first dedicated benchmark for Foley-style video-to-audio generation, with a large, diverse dataset and comprehensive evaluation metrics.
Findings
Existing datasets have poor audio-visual correspondence for Foley tasks.
FoleyBench covers a wider range of Foley sound categories.
State-of-the-art models show room for improvement in audio-visual alignment.
Abstract
Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically aligned with visible events and temporally aligned with their timing. Yet, there is a mismatch between evaluation and downstream applications due to the absence of a benchmark tailored to Foley-style scenarios. We find that 74% of videos from past evaluation datasets have poor audio-visual correspondence. Moreover, they are dominated by speech and music, domains that lie outside the use case for Foley. To address this gap, we introduce FoleyBench, the first large-scale benchmark explicitly designed for Foley-style V2A evaluation. FoleyBench contains 5,000 (video, ground-truth audio, text caption) triplets, each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization
