OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

Weiguo Pian; Saksham Singh Kushwaha; Zhimin Chen; Shijian Deng; Kai Wang; Yunhui Guo; Yapeng Tian

arXiv:2604.04348·cs.SD·April 7, 2026

OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

Weiguo Pian, Saksham Singh Kushwaha, Zhimin Chen, Shijian Deng, Kai Wang, Yunhui Guo, Yapeng Tian

PDF

1 Repo

TL;DR

OmniSonic introduces a diffusion-based framework conditioned on video and text to generate comprehensive auditory scenes, including on-screen, off-screen, and speech sounds, surpassing prior models' limitations.

Contribution

The paper presents OmniSonic, a novel flow-matching diffusion model with a TriAttn-DiT architecture and MoE gating, enabling universal holistic audio generation from video and text.

Findings

01

OmniSonic outperforms existing models on objective metrics.

02

The model effectively generates on-screen, off-screen, and speech sounds simultaneously.

03

Extensive experiments validate OmniSonic as a strong baseline for universal audio generation.

Abstract

In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://weiguopian.github.io/OmniSonic_webpage
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.