OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

Yaofeng Su; Yuming Li; Zeyue Xue; Jie Huang; Siming Fu; Haoran Li; Ying Li; Zezhong Qian; Haoyang Huang; and Nan Duan

arXiv:2603.11647·cs.MM·March 16, 2026

OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

Yaofeng Su, Yuming Li, Zeyue Xue, Jie Huang, Siming Fu, Haoran Li, Ying Li, Zezhong Qian, Haoyang Huang, and Nan Duan

PDF

Open Access

TL;DR

OmniForcing introduces a novel framework that enables real-time, high-quality joint audio-visual generation by distilling bidirectional diffusion models into an efficient autoregressive generator, overcoming latency and training stability issues.

Contribution

The paper presents the first streaming autoregressive framework for joint audio-visual generation, with innovative techniques to address training instability and synchronization challenges.

Findings

01

Achieves state-of-the-art streaming generation at ~25 FPS on a single GPU.

02

Maintains multi-modal synchronization and visual quality comparable to bidirectional models.

03

Introduces novel alignment and distillation techniques to stabilize training and improve performance.

Abstract

Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis