Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

Jun Wang; Xijuan Zeng; Chunyu Qiang; Ruilong Chen; Shiyao Wang; Le Wang; Wangjing Zhou; Pengfei Cai; Jiahui Zhao; Nan Li; Zihan Li; Yuzhe Liang; Xiaopeng Wang; Haorui Zheng; Ming Wen; Kang Yin; Yiran Wang; Nan Li; Feng Deng; Liang Dong; Chen Zhang; Di Zhang; Kun Gai

arXiv:2506.19774·eess.AS·June 25, 2025

Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Jiahui Zhao, Nan Li, Zihan Li, Yuzhe Liang, Xiaopeng Wang, Haorui Zheng, Ming Wen, Kang Yin, Yiran Wang, Nan Li, Feng Deng, Liang Dong, Chen Zhang, Di Zhang, Kun Gai

PDF

Open Access 1 Datasets

TL;DR

Kling-Foley is a multimodal diffusion transformer model that synthesizes high-quality, synchronized video-to-audio content, leveraging novel modules for improved alignment, a universal audio codec, and a new benchmark for evaluation.

Contribution

The paper introduces Kling-Foley, a comprehensive multimodal model with innovative modules and a universal audio codec, setting new state-of-the-art results in video-to-audio generation.

Findings

01

Achieves state-of-the-art performance in audio-visual synchronization.

02

Effectively models diverse audio types including speech, music, and sound effects.

03

Provides a new benchmark for evaluating video-to-audio models.

Abstract

We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alignment capabilities. Specifically, these modules align video conditions with latent audio elements at the frame level, thereby improving semantic alignment and audio-visual synchronization. Together with text conditions, this integrated approach enables precise generation of video-matching sound effects. In addition, we propose a universal latent audio codec that can achieve high-quality modeling in various scenarios such as sound effects, speech, singing, and music. We employ a stereo rendering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

klingfoley/Kling-Audio-Eval
dataset· 3.5k dl
3.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Multimedia Communication and Technology

MethodsDiffusion · ALIGN