TL;DR
Rolling Sink is a training-free method that extends autoregressive video diffusion models to ultra-long durations, maintaining visual quality and temporal coherence beyond limited training horizons.
Contribution
It introduces Rolling Sink, a novel approach that bridges the train-test gap for long-video synthesis without additional training, based on analysis of AR cache management.
Findings
Enables 5-30 minute video synthesis at 16 FPS with stable quality.
Achieves superior long-horizon visual fidelity and temporal consistency.
Built on Self Forcing, effective with only 5s clip training.
Abstract
Recently, autoregressive (AR) video diffusion models have achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
