Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control
Bingliang Li, Fengyu Yang, Yuxin Mao, Qingwen Ye, Hongkai Chen, Yiran, Zhong

TL;DR
Tri-Ergon is a diffusion-based video-to-audio model that integrates multi-modal prompts and LUFS control to generate high-fidelity, fine-grained stereo audio aligned with video content, surpassing previous methods in quality and control.
Contribution
We introduce Tri-Ergon, a novel V2A model that combines multi-modal prompts and LUFS embedding for detailed, controllable audio synthesis from video.
Findings
Generates 44.1 kHz stereo audio up to 60 seconds.
Outperforms existing V2A models in audio quality and control.
Enables precise loudness adjustments over time.
Abstract
Video-to-audio (V2A) generation utilizes visual-only video features to produce realistic sounds that correspond to the scene. However, current V2A models often lack fine-grained control over the generated audio, especially in terms of loudness variation and the incorporation of multi-modal conditions. To overcome these limitations, we introduce Tri-Ergon, a diffusion-based V2A model that incorporates textual, auditory, and pixel-level visual prompts to enable detailed and semantically rich audio synthesis. Additionally, we introduce Loudness Units relative to Full Scale (LUFS) embedding, which allows for precise manual control of the loudness changes over time for individual audio channels, enabling our model to effectively address the intricate correlation of video and audio in real-world Foley workflows. Tri-Ergon is capable of creating 44.1 kHz high-fidelity stereo audio clips of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Music Technology and Sound Studies
