TL;DR
HunyuanVideo-Foley is an end-to-end multimodal diffusion framework that generates high-fidelity, synchronized Foley audio from video and text, overcoming data scarcity and modality imbalance.
Contribution
It introduces a scalable data pipeline, a representation alignment strategy, and a multimodal diffusion transformer for improved audio-visual generation.
Findings
Achieves state-of-the-art audio fidelity and alignment
Demonstrates stable and high-quality audio generation
Outperforms existing methods in temporal and semantic matching
Abstract
Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
