Entropy-Guided k-Guard Sampling for Long-Horizon Autoregressive Video Generation

Yizhao Han; Tianxing Shi; Zhao Wang; Zifan Xu; Zhiyuan Pu; Mingxiao Li; Qian Zhang; Wei Yin; Xiao-Xiao Long

arXiv:2601.19488·cs.CV·February 2, 2026

Entropy-Guided k-Guard Sampling for Long-Horizon Autoregressive Video Generation

Yizhao Han, Tianxing Shi, Zhao Wang, Zifan Xu, Zhiyuan Pu, Mingxiao Li, Qian Zhang, Wei Yin, Xiao-Xiao Long

PDF

Open Access 4 Reviews

TL;DR

This paper introduces ENkG, an entropy-guided sampling method for long-horizon autoregressive video generation that adaptively adjusts token candidate sizes based on predicted distribution entropy, improving quality and stability.

Contribution

The paper proposes a novel entropy-guided sampling strategy for video generation that dynamically adapts to token uncertainty, enhancing long-term quality without retraining.

Findings

01

Improved perceptual quality over static sampling methods.

02

Enhanced structural stability in generated videos.

03

Model-agnostic and training-free approach.

Abstract

Autoregressive (AR) architectures have achieved significant successes in LLMs, inspiring explorations for video generation. In LLMs, top-p/top-k sampling strategies work exceptionally well: language tokens have high semantic density and low redundancy, so a fixed size of token candidates already strikes a balance between semantic accuracy and generation diversity. In contrast, video tokens have low semantic density and high spatio-temporal redundancy. This mismatch makes static top-k/top-p strategies ineffective for video decoders: they either introduce unnecessary randomness for low-uncertainty regions (static backgrounds) or get stuck in early errors for high-uncertainty regions (foreground objects). Prediction errors will accumulate as more frames are generated and eventually severely degrade long-horizon quality. To address this, we propose Entropy-Guided k-Guard (ENkG) sampling, a…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper is well-written. The proposed method is well-illustrated and easy to follow. 2. **Clear motivation and insightful problem analysis:** The paper provides a very clear and compelling motivation for the work. The analysis of the fundamental differences between language and video tokens, the connection between token entropy and the semantic structure of the image, and the identification of the "entropy collapse" phenomenon are insightful and effectively frame the problem. 3. **The sol

Weaknesses

1. **The similar entropy-based method for AR sampling strategies has been explored by previous work** [1], which limits the novelty of this work. [1] Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy. 2. **Lack of hyperparameter sensitivity analysis:** The method introduces a set of new hyperparameters, including `plow`, `phigh`, `Hlow`, `Hhigh`, and `kg`. The paper reports the values used but does not provide any analysis of the method's sensitivity to

Reviewer 02Rating 6Confidence 4

Strengths

The paper's primary strength is its clear and insightful diagnosis of the problem, effectively distinguishing the statistical properties of video versus language tokens and identifying the "entropy collapse" phenomenon. The proposed ENKG method is simple and highly practical, as it can be applied as a plug-and-play module to existing models without any retraining. The experimental results are convincing, showing consistent and significant improvements across multiple state-of-the-art video model

Weaknesses

The main weakness lies in the potentially limited novelty of the core concepts. While their application to video generation is new and insightful, entropy-guided adaptation and hybrid sampling methods have been explored in other domains. Additionally, the evaluation is heavily focused on autonomous driving scenarios. While effective here, it remains an open question how well this strategy would generalize to more open-domain or creative video generation tasks, which may exhibit different uncerta

Reviewer 03Rating 4Confidence 3

Strengths

1. The theoretical analysis of the method is quite thorough. 2. The comparison metrics are sufficient, but the comparison methods are somewhat lacking.

Weaknesses

1. The paper mentions "Long-Horizon Autoregressive Video Generation" but only verifies it with 75-frame data, failing to clarify the effect of ENkG on suppressing entropy collapse for longer sequences (e.g., 100+ frames). It is recommended to: conduct experiments on long sequences of 100-200 frames; supplement entropy change curves under different frame counts (such as averaging entropy statistics every 10 frames); and quantitatively compare the differences in entropy collapse rates between ENkG

Reviewer 04Rating 6Confidence 4

Strengths

1. A Highly Insightful and Principled Diagnosis of AR Failure Modes: The paper's most significant contribution is its profound and novel diagnosis of a key failure mode in AR video generation operating on discrete tokens. It provides a principled explanation rooted in the fundamental mismatch between static sampling strategies and the spatially structured uncertainty of video tokens. The identification of "entropy collapse" is a novel and valuable insight that clarifies a previously poorly under

Weaknesses

1. Limited Applicability to Continuous-Space Autoregressive Models: The proposed ENkG method is fundamentally designed for models that operate on a discrete vocabulary of video tokens. While this is a significant class of models, a substantial and growing body of work in autoregressive video generation operates in continuous latent spaces. The core mechanism of ENkG—truncating a categorical distribution—does not directly transfer to these continuous domains. The paper should more explicitly ackn

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Speech and Audio Processing