UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

Min Zhao; Hongzhou Zhu; Yingze Wang; Bokai Yan; Jintao Zhang; Guande He; Ling Yang; Chongxuan Li; Jun Zhu

arXiv:2511.20123·cs.CV·March 3, 2026

UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang, Guande He, Ling Yang, Chongxuan Li, Jun Zhu

PDF

Open Access 3 Reviews

TL;DR

UltraViCo is a training-free method that enhances video diffusion transformers' ability to generalize beyond training length by suppressing attention dispersion, significantly improving extrapolation performance and quality.

Contribution

We introduce UltraViCo, a novel plug-and-play approach that addresses attention dispersion to improve video length extrapolation in diffusion transformers without additional training.

Findings

01

Outperforms baselines across models and ratios

02

Increases extrapolation limit from 2x to 4x

03

Improves quality metrics by over 40% at 4x extrapolation

Abstract

Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1) The paper is very well written; the figures are very helpful to understand the motivation behind the proposed method. 2) This work provides a compelling analysis that connects two failure modes (repetition and quality degradation) under the concept of attention dispersion, offering a deeper understanding of underpinnings of video diffusion transformer models. 3) The proposed solution UltraViCo is training-free, simple to implement, and integrates seamlessly with existing popular SOTA video Di

Weaknesses

1) UltraViCo relies on manually chosen decay factors (alpha, beta) and harmonic band width (gamma), which differ across models. The paper does not provide a fully automated procedure for selecting these values, raising concerns about generalization to unseen architectures.

Reviewer 02Rating 2Confidence 5

Strengths

* This study follows a clear methodology: observation, analysis, improvement, and validation, with rigorous logic and well-structured presentation. * Experiments demonstrate the effectiveness of the proposed method on Wan and HunyuanVideo, outperforming baseline approaches.

Weaknesses

* The authors’ modifications to the computation process may introduce potential performance issues. Their improvements focus on the attention map, but in efficient attention implementations (e.g., FlashAttention-3), this matrix is not fully materialized. Consequently, the proposed changes could negatively impact computational efficiency. * Generating long videos is no longer a major challenge, as several publicly available models already support segment-wise long-video generation—such as Wan2.2-

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper provides a unification of two major failure modes under the single concept of attention dispersion. To me, this is a good contribution contending the earlier positional encoding-centric explanations. 2. The approach presented in the paper requires no additional training or model modifications, making it a plug-and-play inference-time fix. Despite its simplicity, it tackles both identified issues simultaneously. 3. The paper gives importance to practical implementation. Integrating

Weaknesses

1. UltraViCo down-weights the influence of very distant frames. A possible side effect is that the model might lose some long-term context or consistency. The paper asserts that important content is preserved while removing irrelevant far-context influence but it does not rigorously quantify scene consistency over extremely long videos. In one ablation, using too small an $\alpha$ caused a car tire to disappear in later frames, indicating that overly concentrating attention can indeed harm persi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Video Quality Assessment · Generative Adversarial Networks and Image Synthesis · Video Coding and Compression Technologies