Enhance-A-Video: Better Generated Video for Free

Yang Luo; Xuanlei Zhao; Mengzhao Chen; Kaipeng Zhang; Wenqi Shao; Kai; Wang; Zhangyang Wang; Yang You

arXiv:2502.07508·cs.CV·February 28, 2025

Enhance-A-Video: Better Generated Video for Free

Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai, Wang, Zhangyang Wang, Yang You

PDF

Open Access 1 Repo 4 Reviews

TL;DR

Enhance-A-Video is a training-free method that improves the coherence and quality of DiT-based generated videos by enhancing cross-frame correlations, applicable without retraining, leading to better temporal consistency and visual quality.

Contribution

We introduce a simple, training-free enhancement technique for DiT-based video generation that boosts temporal coherence and visual quality without retraining.

Findings

01

Improves temporal consistency in generated videos

02

Enhances visual quality across various DiT models

03

Applicable without retraining or fine-tuning

Abstract

DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to its simple design, our approach can be easily applied to most DiT-based video generation frameworks without any retraining or fine-tuning. Across various DiT-based video generation models, our approach demonstrates promising improvements in both temporal consistency and visual quality. We hope this research can inspire future explorations in video generation enhancement.

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 8Confidence 5

Strengths

- the paper presents convincing visualizations and clearly justifies its motivation. the idea of using non-diagonal temporal attention with temperature for better temporal consistency seems to novel for video generation tasks. the module design is simple and intuitive - the paper conducted a user study to validate the effectiveness of the proposed method and extensive experiments demonstrate the effectiveness of the proposed method - the proposed method achieves reasonable performance improvemen

Weaknesses

- the paper adopts VBench for quantitative comparisons. however, the reliability of the VBench metrics is still not well justified. based on the reviewer's experience, some scores might favor specific aspects of videos while ignoring the actual visual quality. we can see some of the reported improvements are quite marginal - the number of samples used in user study seems to be limited and how those samples are selected is not mentioned (randomly selected or not?) - the experimental settings of

Reviewer 02Rating 2Confidence 4

Strengths

1. This paper is well-structured and easy to follow, making it accessible for readers at different levels of expertise. 2. The proposed Enhance-A-Video is a training-free and plug-and-play method, it is easy to integrate it with existing DiT-based T2V models, including SOTA models like WAN and HunyuanVideo. 3. The authors have conducted extensive experiments to demonstrate the effectiveness of Enhance-A-Video.

Weaknesses

1. **The visual improvement is not significant**. The visual comparison depicted in Fig. 1, Fig. 6 (b) and Fig. 8 (left) does not present a significant performance gain compared to the original results. Moreover, I doubt that the results in Fig. 1 is actually a cherry-picked one, since current T2V models possess limited capability (or domain knowledge) in generating coherent limb features, but the authors claimed that their method can tackle these issues in a training-free manner without modifyi

Reviewer 03Rating 4Confidence 4

Strengths

1. The proposed Enhance Block is conceptually clean, allowing it to be easily integrated into temporal attention modules in existing DiT-based video generation models without re-training or fine-tuning. 2. The proposed method improves the performance of the existing models with <3% runtime increase on A100. 3. Enhance-A-Video demonstrates its superiority across 3D-full and spatial-temporal attention video generation models (Wan2.1, HunyuanVideo, CogVideoX, LTX-Video, Open-Sora), suggesting decen

Weaknesses

1. The 110-participant preference study is promising, yet drawn from only 15 prompts with an uneven per-model sample count. More prompts and counterbalancing would strengthen claims. 2. Why can’t model training learn the calibrated attention pattern? Your observation is that learned temporal attention concentrates on diagonals, under-exploiting cross-frame cues. Couldn’t one add a regularizer encouraging non-diagonals so that the model learns similar CFI-like balancing during training? 3. The te

Reviewer 04Rating 2Confidence 4

Strengths

1. The method is training-free and functions as a plug-and-play module. This allows it to be integrated into various existing, pre-trained DiT-based video generation models without any costly retraining or fine-tuning. 2. The Enhance Block introduces negligible computational overhead during inference.

Weaknesses

1. Conceptual Contradiction of Temperature A primary weakness is the conceptual contradiction in the definition and application of temperature (tau). - In Section 3.2, the paper introduces 'temperature' in its classical sense: a parameter applied inside the softmax function to modulate the probabilistic distribution of attention weights without altering the feature scale (Eq. (4)) - However, the proposed enhance temperature in Section 3.3 is used in a completely different mechanism. As show

Code & Models

Repositories

NUS-HPC-AI-Lab/Enhance-A-Video
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Video Quality Assessment

MethodsSoftmax · Attention Is All You Need