What is Wrong with Perplexity for Long-context Language Modeling?
Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, Yisen Wang

TL;DR
This paper identifies the limitations of perplexity in evaluating long-context language models, introduces LongPPL as a better metric, and proposes LongCE for improved fine-tuning, leading to more accurate assessments and enhancements.
Contribution
The paper reveals why perplexity fails for long contexts, proposes LongPPL to focus on key tokens, and introduces LongCE for better model training.
Findings
LongPPL correlates strongly with long-context performance (Pearson -0.96)
Traditional PPL overlooks key tokens, misrepresenting model capabilities
LongCE improves model performance across benchmarks
Abstract
Handling long-context inputs is crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. The underlying cause of this limitation has remained unclear. In this work, we provide a comprehensive explanation for this issue. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose \textbf{LongPPL}, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Our experiments demonstrate that…
Peer Reviews
Decision·ICLR 2025 Poster
The idea is simple and seems working well.
n/a
This paper conducts a fine-grained analysis to investigate why perplexity (PPL) fails to capture long-context capabilities in LLMs. Few previous studies focus on re-weighting tokens to improve long-context performance; this work addresses this directly by re-weighting tokens based on their dependence on long-context information, providing a more fundamental and targeted solution.
Some settings in the figures lack clear explanations, which affects the reliability of the figures and the conclusions drawn from them. For example, in Figure 2, the specific prompt lengths for each data point are not provided, making it difficult to interpret the statement, "Each point represents the results obtained from testing at a specific prompt length." Although the correlation arguments seem promising and qualitatively sound, they would benefit from added statistical rigor, such as p-val
- The introduction of LongPPL and LongCE represents an innovation in the evaluation and fine-tuning of LLMs for long-context tasks. These metrics address the limitations of traditional perplexity and Cross Entropy loss, providing more accurate and task-relevant assessments of model performance. - The concepts presented by the authors are easy to understand, including the motivation and method. The figures greatly help in understanding the concepts and the experiments.
- LongPPL metric relies on a relatively strong model (medium-sized Qwen2-72B-Instruct) as an evaluator to identify key tokens. This dependence may limit the applicability of the metric in scenarios where such a strong model is not available or practical to use. - The additional computational cost associated with identifying key tokens may still be significant. This could be a barrier to adoption in resource-constrained environments.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
