DreamFoley: Scalable VLMs for High-Fidelity Video-to-Audio Generation

Fu Li; Weichao Zhao; You Li; Zhichao Zhou; Dongliang He

arXiv:2512.06022·cs.SD·December 9, 2025

DreamFoley: Scalable VLMs for High-Fidelity Video-to-Audio Generation

Fu Li, Weichao Zhao, You Li, Zhichao Zhou, Dongliang He

PDF

Open Access 1 Datasets

TL;DR

DreamFoley introduces a scalable autoregressive model leveraging large vision-language models to generate high-fidelity, synchronized video-to-audio content, addressing previous limitations in audio-visual alignment and quality.

Contribution

The paper presents a novel autoregressive architecture with dual-visual encoders, a residual vector quantization tokenizer, and classifier-free guidance, advancing scalable high-quality video-to-audio generation.

Findings

01

Achieves promising performance on benchmark datasets.

02

Effectively balances training efficiency and audio quality.

03

Provides a new dataset with missing audio-visual textual descriptions.

Abstract

Recent advances in video generation have achieved remarkable improvements in visual content fidelity. However, the absence of synchronized audio severely undermines immersive experience and restricts practical applications of these technologies. To address this challenge, several pioneering works have explored diffusion transformer architectures for generating plausible video-synchronized audio, including Kling-foley, HunyuanVideo-foley and Thinksound. Distinct from existing works, we introduce an autoregressive audio generation architecture (DreamFoley) that harnesses the capabilities of large vision-language models (VLMs) to jointly model sequential interactions among video, audio, and text modalities. Our approach features a dual-visual encoder module that effectively captures both audio-aligned and text-aligned visual features. Additionally, we employ a Residual Vector Quantization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Zhaowc/DreamFoley
dataset· 8 dl
8 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Music and Audio Processing