Entropy-Guided k-Guard Sampling for Long-Horizon Autoregressive Video Generation
Yizhao Han, Tianxing Shi, Zhao Wang, Zifan Xu, Zhiyuan Pu, Mingxiao Li, Qian Zhang, Wei Yin, Xiao-Xiao Long

TL;DR
This paper introduces ENkG, an entropy-guided sampling method for long-horizon autoregressive video generation that adaptively adjusts token candidate sizes based on predicted distribution entropy, improving quality and stability.
Contribution
The paper proposes a novel entropy-guided sampling strategy for video generation that dynamically adapts to token uncertainty, enhancing long-term quality without retraining.
Findings
Improved perceptual quality over static sampling methods.
Enhanced structural stability in generated videos.
Model-agnostic and training-free approach.
Abstract
Autoregressive (AR) architectures have achieved significant successes in LLMs, inspiring explorations for video generation. In LLMs, top-p/top-k sampling strategies work exceptionally well: language tokens have high semantic density and low redundancy, so a fixed size of token candidates already strikes a balance between semantic accuracy and generation diversity. In contrast, video tokens have low semantic density and high spatio-temporal redundancy. This mismatch makes static top-k/top-p strategies ineffective for video decoders: they either introduce unnecessary randomness for low-uncertainty regions (static backgrounds) or get stuck in early errors for high-uncertainty regions (foreground objects). Prediction errors will accumulate as more frames are generated and eventually severely degrade long-horizon quality. To address this, we propose Entropy-Guided k-Guard (ENkG) sampling, a…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1. The paper is well-written. The proposed method is well-illustrated and easy to follow. 2. **Clear motivation and insightful problem analysis:** The paper provides a very clear and compelling motivation for the work. The analysis of the fundamental differences between language and video tokens, the connection between token entropy and the semantic structure of the image, and the identification of the "entropy collapse" phenomenon are insightful and effectively frame the problem. 3. **The sol
1. **The similar entropy-based method for AR sampling strategies has been explored by previous work** [1], which limits the novelty of this work. [1] Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy. 2. **Lack of hyperparameter sensitivity analysis:** The method introduces a set of new hyperparameters, including `plow`, `phigh`, `Hlow`, `Hhigh`, and `kg`. The paper reports the values used but does not provide any analysis of the method's sensitivity to
The paper's primary strength is its clear and insightful diagnosis of the problem, effectively distinguishing the statistical properties of video versus language tokens and identifying the "entropy collapse" phenomenon. The proposed ENKG method is simple and highly practical, as it can be applied as a plug-and-play module to existing models without any retraining. The experimental results are convincing, showing consistent and significant improvements across multiple state-of-the-art video model
The main weakness lies in the potentially limited novelty of the core concepts. While their application to video generation is new and insightful, entropy-guided adaptation and hybrid sampling methods have been explored in other domains. Additionally, the evaluation is heavily focused on autonomous driving scenarios. While effective here, it remains an open question how well this strategy would generalize to more open-domain or creative video generation tasks, which may exhibit different uncerta
1. The theoretical analysis of the method is quite thorough. 2. The comparison metrics are sufficient, but the comparison methods are somewhat lacking.
1. The paper mentions "Long-Horizon Autoregressive Video Generation" but only verifies it with 75-frame data, failing to clarify the effect of ENkG on suppressing entropy collapse for longer sequences (e.g., 100+ frames). It is recommended to: conduct experiments on long sequences of 100-200 frames; supplement entropy change curves under different frame counts (such as averaging entropy statistics every 10 frames); and quantitatively compare the differences in entropy collapse rates between ENkG
1. A Highly Insightful and Principled Diagnosis of AR Failure Modes: The paper's most significant contribution is its profound and novel diagnosis of a key failure mode in AR video generation operating on discrete tokens. It provides a principled explanation rooted in the fundamental mismatch between static sampling strategies and the spatially structured uncertainty of video tokens. The identification of "entropy collapse" is a novel and valuable insight that clarifies a previously poorly under
1. Limited Applicability to Continuous-Space Autoregressive Models: The proposed ENkG method is fundamentally designed for models that operate on a discrete vocabulary of video tokens. While this is a significant class of models, a substantial and growing body of work in autoregressive video generation operates in continuous latent spaces. The core mechanism of ENkG—truncating a categorical distribution—does not directly transfer to these continuous domains. The paper should more explicitly ackn
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Speech and Audio Processing
