Omni2Sound: Towards Unified Video-Text-to-Audio Generation

Yusheng Dai; Zehua Chen; Yuxuan Jiang; Baolong Gao; Qiuhong Ke; Jianfei Cai; Jun Zhu

arXiv:2601.02731·cs.SD·April 30, 2026

Omni2Sound: Towards Unified Video-Text-to-Audio Generation

Yusheng Dai, Zehua Chen, Yuxuan Jiang, Baolong Gao, Qiuhong Ke, Jianfei Cai, Jun Zhu

PDF

1 Models 2 Datasets

TL;DR

This paper introduces Omni2Sound, a unified model for video-text-to-audio generation, supported by a large-scale dataset SoundAtlas, addressing data scarcity and modality bias, achieving state-of-the-art results across tasks.

Contribution

The paper presents a novel unified diffusion model and a large-scale dataset to improve video-text-to-audio generation and address cross-task competition and data limitations.

Findings

01

SoundAtlas outperforms existing datasets and even human experts in quality.

02

Omni2Sound achieves state-of-the-art performance across video-to-audio, text-to-audio, and joint tasks.

03

The proposed training schedule effectively balances multiple modalities and tasks.

Abstract

Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight V-A-T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross-task and intra-task competition, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision-to-Language Compression to mitigate visual bias of MLLMs, a Junior-Senior Agent Handoff for a 5 $\times$ cost reduction, and rigorous Post-hoc Filtering to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Dalision/Omni2Sound
model· 78 dl· ♡ 5
78 dl♡ 5

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.