Enhance-A-Video: Better Generated Video for Free
Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai, Wang, Zhangyang Wang, Yang You

TL;DR
Enhance-A-Video is a training-free method that improves the coherence and quality of DiT-based generated videos by enhancing cross-frame correlations, applicable without retraining, leading to better temporal consistency and visual quality.
Contribution
We introduce a simple, training-free enhancement technique for DiT-based video generation that boosts temporal coherence and visual quality without retraining.
Findings
Improves temporal consistency in generated videos
Enhances visual quality across various DiT models
Applicable without retraining or fine-tuning
Abstract
DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to its simple design, our approach can be easily applied to most DiT-based video generation frameworks without any retraining or fine-tuning. Across various DiT-based video generation models, our approach demonstrates promising improvements in both temporal consistency and visual quality. We hope this research can inspire future explorations in video generation enhancement.
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- the paper presents convincing visualizations and clearly justifies its motivation. the idea of using non-diagonal temporal attention with temperature for better temporal consistency seems to novel for video generation tasks. the module design is simple and intuitive - the paper conducted a user study to validate the effectiveness of the proposed method and extensive experiments demonstrate the effectiveness of the proposed method - the proposed method achieves reasonable performance improvemen
- the paper adopts VBench for quantitative comparisons. however, the reliability of the VBench metrics is still not well justified. based on the reviewer's experience, some scores might favor specific aspects of videos while ignoring the actual visual quality. we can see some of the reported improvements are quite marginal - the number of samples used in user study seems to be limited and how those samples are selected is not mentioned (randomly selected or not?) - the experimental settings of
1. This paper is well-structured and easy to follow, making it accessible for readers at different levels of expertise. 2. The proposed Enhance-A-Video is a training-free and plug-and-play method, it is easy to integrate it with existing DiT-based T2V models, including SOTA models like WAN and HunyuanVideo. 3. The authors have conducted extensive experiments to demonstrate the effectiveness of Enhance-A-Video.
1. **The visual improvement is not significant**. The visual comparison depicted in Fig. 1, Fig. 6 (b) and Fig. 8 (left) does not present a significant performance gain compared to the original results. Moreover, I doubt that the results in Fig. 1 is actually a cherry-picked one, since current T2V models possess limited capability (or domain knowledge) in generating coherent limb features, but the authors claimed that their method can tackle these issues in a training-free manner without modifyi
1. The proposed Enhance Block is conceptually clean, allowing it to be easily integrated into temporal attention modules in existing DiT-based video generation models without re-training or fine-tuning. 2. The proposed method improves the performance of the existing models with <3% runtime increase on A100. 3. Enhance-A-Video demonstrates its superiority across 3D-full and spatial-temporal attention video generation models (Wan2.1, HunyuanVideo, CogVideoX, LTX-Video, Open-Sora), suggesting decen
1. The 110-participant preference study is promising, yet drawn from only 15 prompts with an uneven per-model sample count. More prompts and counterbalancing would strengthen claims. 2. Why can’t model training learn the calibrated attention pattern? Your observation is that learned temporal attention concentrates on diagonals, under-exploiting cross-frame cues. Couldn’t one add a regularizer encouraging non-diagonals so that the model learns similar CFI-like balancing during training? 3. The te
1. The method is training-free and functions as a plug-and-play module. This allows it to be integrated into various existing, pre-trained DiT-based video generation models without any costly retraining or fine-tuning. 2. The Enhance Block introduces negligible computational overhead during inference.
1. Conceptual Contradiction of Temperature A primary weakness is the conceptual contradiction in the definition and application of temperature (tau). - In Section 3.2, the paper introduces 'temperature' in its classical sense: a parameter applied inside the softmax function to modulate the probabilistic distribution of attention weights without altering the feature scale (Eq. (4)) - However, the proposed enhance temperature in Section 3.3 is used in a completely different mechanism. As show
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment
MethodsSoftmax · Attention Is All You Need
