HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

Sizhe Shan; Qiulin Li; Yutao Cui; Miles Yang; Yuehai Wang; Qun Yang; Jin Zhou; Zhao Zhong

arXiv:2508.16930·eess.AS·August 26, 2025

HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, Zhao Zhong

PDF

2 Models

TL;DR

HunyuanVideo-Foley is an end-to-end multimodal diffusion framework that generates high-fidelity, synchronized Foley audio from video and text, overcoming data scarcity and modality imbalance.

Contribution

It introduces a scalable data pipeline, a representation alignment strategy, and a multimodal diffusion transformer for improved audio-visual generation.

Findings

01

Achieves state-of-the-art audio fidelity and alignment

02

Demonstrates stable and high-quality audio generation

03

Outperforms existing methods in temporal and semantic matching

Abstract

Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.