TL;DR
This paper introduces Mutual Forcing, a novel framework for fast autoregressive audio-video generation that enables efficient, high-quality, long-horizon synchronization without relying on bidirectional teachers.
Contribution
It presents a native causal model with integrated multi-step and few-step generation, eliminating the need for complex distillation pipelines and improving training-inference consistency.
Findings
Achieves comparable or better quality with only 4-8 sampling steps compared to 50 steps in baselines.
Supports flexible sequence lengths and reduces training overhead.
Outperforms prior approaches like Self-Forcing in efficiency and quality.
Abstract
In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train uni-modal generators and then couple them into a unified audio-video model for joint training on paired data. For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streaming distillation pipelines that typically train a bidirectional model first and then convert it into a causal generator through multiple distillation stages. Our answer is Mutual Forcing, which builds directly on native autoregressive model and integrates few-step and multi-step generation within a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
