Omni2Sound: Towards Unified Video-Text-to-Audio Generation
Yusheng Dai, Zehua Chen, Yuxuan Jiang, Baolong Gao, Qiuhong Ke, Jianfei Cai, Jun Zhu

TL;DR
This paper introduces Omni2Sound, a unified model for video-text-to-audio generation, supported by a large-scale dataset SoundAtlas, addressing data scarcity and modality bias, achieving state-of-the-art results across tasks.
Contribution
The paper presents a novel unified diffusion model and a large-scale dataset to improve video-text-to-audio generation and address cross-task competition and data limitations.
Findings
SoundAtlas outperforms existing datasets and even human experts in quality.
Omni2Sound achieves state-of-the-art performance across video-to-audio, text-to-audio, and joint tasks.
The proposed training schedule effectively balances multiple modalities and tasks.
Abstract
Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight V-A-T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross-task and intra-task competition, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision-to-Language Compression to mitigate visual bias of MLLMs, a Junior-Senior Agent Handoff for a 5 cost reduction, and rigorous Post-hoc Filtering to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
