Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

Yupeng Zhou; Lianghua Huang; Zhifan Wu; Jiabao Wang; Yupeng Shi; Biao Jiang; Daquan Zhou; Yu Liu; Ming-Ming Cheng; Qibin Hou

arXiv:2604.25819·cs.CV·April 29, 2026

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

Yupeng Zhou, Lianghua Huang, Zhifan Wu, Jiabao Wang, Yupeng Shi, Biao Jiang, Daquan Zhou, Yu Liu, Ming-Ming Cheng, Qibin Hou

PDF

2 Repos

TL;DR

This paper introduces Mutual Forcing, a novel framework for fast autoregressive audio-video generation that enables efficient, high-quality, long-horizon synchronization without relying on bidirectional teachers.

Contribution

It presents a native causal model with integrated multi-step and few-step generation, eliminating the need for complex distillation pipelines and improving training-inference consistency.

Findings

01

Achieves comparable or better quality with only 4-8 sampling steps compared to 50 steps in baselines.

02

Supports flexible sequence lengths and reduces training overhead.

03

Outperforms prior approaches like Self-Forcing in efficiency and quality.

Abstract

In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train uni-modal generators and then couple them into a unified audio-video model for joint training on paired data. For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streaming distillation pipelines that typically train a bidirectional model first and then convert it into a causal generator through multiple distillation stages. Our answer is Mutual Forcing, which builds directly on native autoregressive model and integrates few-step and multi-step generation within a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.