Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow   Matching

Yongqi Wang; Wenxiang Guo; Rongjie Huang; Jiawei Huang; Zehan Wang,; Fuming You; Ruiqi Li; Zhou Zhao

arXiv:2406.00320·cs.SD·January 7, 2025

Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching

Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang,, Fuming You, Ruiqi Li, Zhou Zhao

PDF

Open Access 1 Repo 1 Video

TL;DR

Frieren is a novel video-to-audio generation model utilizing rectified flow matching, achieving high-quality, synchronized audio synthesis efficiently with fewer sampling steps, outperforming previous methods.

Contribution

The paper introduces Frieren, a rectified flow matching-based V2A model that improves audio quality, synchronization, and efficiency over existing autoregressive and diffusion models.

Findings

01

Achieves state-of-the-art quality and synchronization on VGGSound dataset.

02

Reaches 97.22% alignment accuracy.

03

Improves inception score by 6.2% over diffusion baselines.

Abstract

Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video, and it remains challenging to build V2A models with high generation quality, efficiency, and visual-audio temporal synchrony. We propose Frieren, a V2A model based on rectified flow matching. Frieren regresses the conditional transport vector field from noise to spectrogram latent with straight paths and conducts sampling by solving ODE, outperforming autoregressive and score-based models in terms of audio quality. By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, our model generates audio that is highly synchronized with the input video. Furthermore, through reflow and one-step distillation with guided vector field, our model can generate decent audio in a few, or even only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cyanbx/Frieren-V2A
pytorchOfficial

Videos

Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching· slideslive

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Video Analysis and Summarization