UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers
Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang, Guande He, Ling Yang, Chongxuan Li, Jun Zhu

TL;DR
UltraViCo is a training-free method that enhances video diffusion transformers' ability to generalize beyond training length by suppressing attention dispersion, significantly improving extrapolation performance and quality.
Contribution
We introduce UltraViCo, a novel plug-and-play approach that addresses attention dispersion to improve video length extrapolation in diffusion transformers without additional training.
Findings
Outperforms baselines across models and ratios
Increases extrapolation limit from 2x to 4x
Improves quality metrics by over 40% at 4x extrapolation
Abstract
Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of…
Peer Reviews
Decision·ICLR 2026 Poster
1) The paper is very well written; the figures are very helpful to understand the motivation behind the proposed method. 2) This work provides a compelling analysis that connects two failure modes (repetition and quality degradation) under the concept of attention dispersion, offering a deeper understanding of underpinnings of video diffusion transformer models. 3) The proposed solution UltraViCo is training-free, simple to implement, and integrates seamlessly with existing popular SOTA video Di
1) UltraViCo relies on manually chosen decay factors (alpha, beta) and harmonic band width (gamma), which differ across models. The paper does not provide a fully automated procedure for selecting these values, raising concerns about generalization to unseen architectures.
* This study follows a clear methodology: observation, analysis, improvement, and validation, with rigorous logic and well-structured presentation. * Experiments demonstrate the effectiveness of the proposed method on Wan and HunyuanVideo, outperforming baseline approaches.
* The authors’ modifications to the computation process may introduce potential performance issues. Their improvements focus on the attention map, but in efficient attention implementations (e.g., FlashAttention-3), this matrix is not fully materialized. Consequently, the proposed changes could negatively impact computational efficiency. * Generating long videos is no longer a major challenge, as several publicly available models already support segment-wise long-video generation—such as Wan2.2-
1. The paper provides a unification of two major failure modes under the single concept of attention dispersion. To me, this is a good contribution contending the earlier positional encoding-centric explanations. 2. The approach presented in the paper requires no additional training or model modifications, making it a plug-and-play inference-time fix. Despite its simplicity, it tackles both identified issues simultaneously. 3. The paper gives importance to practical implementation. Integrating
1. UltraViCo down-weights the influence of very distant frames. A possible side effect is that the model might lose some long-term context or consistency. The paper asserts that important content is preserved while removing irrelevant far-context influence but it does not rigorously quantify scene consistency over extremely long videos. In one ablation, using too small an $\alpha$ caused a car tire to disappear in later frames, indicating that overly concentrating attention can indeed harm persi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment · Generative Adversarial Networks and Image Synthesis · Video Coding and Compression Technologies
