TL;DR
OmniSonic introduces a diffusion-based framework conditioned on video and text to generate comprehensive auditory scenes, including on-screen, off-screen, and speech sounds, surpassing prior models' limitations.
Contribution
The paper presents OmniSonic, a novel flow-matching diffusion model with a TriAttn-DiT architecture and MoE gating, enabling universal holistic audio generation from video and text.
Findings
OmniSonic outperforms existing models on objective metrics.
The model effectively generates on-screen, off-screen, and speech sounds simultaneously.
Extensive experiments validate OmniSonic as a strong baseline for universal audio generation.
Abstract
In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
