Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal   Conditions and LUFS Control

Bingliang Li; Fengyu Yang; Yuxin Mao; Qingwen Ye; Hongkai Chen; Yiran; Zhong

arXiv:2412.20378·cs.CV·December 31, 2024

Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control

Bingliang Li, Fengyu Yang, Yuxin Mao, Qingwen Ye, Hongkai Chen, Yiran, Zhong

PDF

Open Access

TL;DR

Tri-Ergon is a diffusion-based video-to-audio model that integrates multi-modal prompts and LUFS control to generate high-fidelity, fine-grained stereo audio aligned with video content, surpassing previous methods in quality and control.

Contribution

We introduce Tri-Ergon, a novel V2A model that combines multi-modal prompts and LUFS embedding for detailed, controllable audio synthesis from video.

Findings

01

Generates 44.1 kHz stereo audio up to 60 seconds.

02

Outperforms existing V2A models in audio quality and control.

03

Enables precise loudness adjustments over time.

Abstract

Video-to-audio (V2A) generation utilizes visual-only video features to produce realistic sounds that correspond to the scene. However, current V2A models often lack fine-grained control over the generated audio, especially in terms of loudness variation and the incorporation of multi-modal conditions. To overcome these limitations, we introduce Tri-Ergon, a diffusion-based V2A model that incorporates textual, auditory, and pixel-level visual prompts to enable detailed and semantically rich audio synthesis. Additionally, we introduce Loudness Units relative to Full Scale (LUFS) embedding, which allows for precise manual control of the loudness changes over time for individual audio channels, enabling our model to effectively address the intricate correlation of video and audio in real-world Foley workflows. Tri-Ergon is capable of creating 44.1 kHz high-fidelity stereo audio clips of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Music Technology and Sound Studies