Intrinsic Entropy of Context Length Scaling in LLMs
Jingzhe Shi, Qinwei Ma, Hongyi Liu, Hang Zhao, Jeng-Neng Hwang, Lei Li

TL;DR
This paper introduces the concept of intrinsic entropy to analyze how context length affects language model performance, providing theoretical insights and experimental validation on natural and synthetic data.
Contribution
It proposes a novel intrinsic entropy framework to explain context length effects and offers practical guidelines for optimal context length based on dataset size.
Findings
Training dataset size influences optimal context length.
Theoretical bounds on context length scaling are established.
Experimental validation supports the proposed framework.
Abstract
Long Context Language Models have drawn great attention in the past few years. There has been work discussing the impact of long context on Language Model performance: some find that long irrelevant context could harm performance, while some experimentally summarize loss reduction by relevant long context as Scaling Laws. This calls for a more thorough understanding of how long context impacts Language Modeling. In this work, we (1) propose to use `Intrinsic Entropy' for explaining the impact of context length on language modeling; and (2) conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions. Our theoretical framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain cases. We hope our work may inspire new…
Peer Reviews
Decision·ICLR 2026 Oral
1. While the decomposition of loss into Bayes Risk and Approximation Error is standard, and the use of an "intrinsic space" is inspired by prior work, the formulation of "Intrinsic Entropy" as a central concept to bridge context length and loss is novel. 2. The idea that there is a fundamental trade-off leading to an optimal context length is a significant and non-obvious insight that challenges the simplistic view that "more context is always better." 3. The paper is generally well-structured
My main concern is that existing work [1] has pointed out that PPL cannot serve as a standard for evaluating long-text performance. Therefore, experimental results on more diverse long-text benchmarks are necessary, such as RULER, HELMET, and Longbench v2. [1]Fang L, Wang Y, Liu Z, et al. What is Wrong with Perplexity for Long-context Language Modeling?[J]. arXiv preprint arXiv:2410.23771, 2024.
The direction which contructs a decomposition and shows that optimally there exists an optimal context length as shown in section 3 is very interesting. Eq6/7 effectively shows that D would imply an optimal L. There are several empirical results that demonstrate this; besides, the linear relationship between cross entropy and intrinsic entropy seems novel and insightful.
I personally think that the mathematical construction lacks rigor. For example, there is no justification of equaling states with volume divided by dimensions. Although eq(3) is understandable, it is not an acceptable exposition since l is discrete. It is also problematic to directly treat crossentropy as Bayes Risk particularly there is no justification provided throughout the paper. The paper is badly presented. The figures are not very readable most of the time with non informative titles (e
The paper presents an interesting result, consisting of a non-trivial observation and a sound and consistent theory. The authors provide a convincing set of evidence that supports their theoretical assumptions.
The paper claims that the theoretical framework presented can provide practical insights, such as establishing that the size of the training dataset dictates an optimal context length and bounds context-length scaling. It seems to miss that, in practice, the context length is determined not to minimize the loss function but rather for other considerations, such as the typical context length in the target use cases and the capabilities of the computational hardware. If the prescribed “optimal” co
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need
