FoleyBench: A Benchmark For Video-to-Audio Models

Satvik Dixit; Koichi Saito; Zhi Zhong; Yuki Mitsufuji; Chris Donahue

arXiv:2511.13219·cs.SD·November 25, 2025

FoleyBench: A Benchmark For Video-to-Audio Models

Satvik Dixit, Koichi Saito, Zhi Zhong, Yuki Mitsufuji, Chris Donahue

PDF

Open Access 2 Datasets

TL;DR

FoleyBench is a new large-scale benchmark dataset designed specifically for evaluating video-to-audio models in Foley sound generation, addressing the mismatch in existing datasets and enabling more accurate assessment of audio-visual alignment and sound quality.

Contribution

This paper introduces FoleyBench, the first dedicated benchmark for Foley-style video-to-audio generation, with a large, diverse dataset and comprehensive evaluation metrics.

Findings

01

Existing datasets have poor audio-visual correspondence for Foley tasks.

02

FoleyBench covers a wider range of Foley sound categories.

03

State-of-the-art models show room for improvement in audio-visual alignment.

Abstract

Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically aligned with visible events and temporally aligned with their timing. Yet, there is a mismatch between evaluation and downstream applications due to the absence of a benchmark tailored to Foley-style scenarios. We find that 74% of videos from past evaluation datasets have poor audio-visual correspondence. Moreover, they are dominated by speech and music, domains that lie outside the use case for Foley. To address this gap, we introduce FoleyBench, the first large-scale benchmark explicitly designed for Foley-style V2A evaluation. FoleyBench contains 5,000 (video, ground-truth audio, text caption) triplets, each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization