What is Wrong with Perplexity for Long-context Language Modeling?

Lizhe Fang; Yifei Wang; Zhaoyang Liu; Chenheng Zhang; Stefanie Jegelka; Jinyang Gao; Bolin Ding; Yisen Wang

arXiv:2410.23771·cs.CL·July 29, 2025·2 cites

What is Wrong with Perplexity for Long-context Language Modeling?

Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, Yisen Wang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper identifies the limitations of perplexity in evaluating long-context language models, introduces LongPPL as a better metric, and proposes LongCE for improved fine-tuning, leading to more accurate assessments and enhancements.

Contribution

The paper reveals why perplexity fails for long contexts, proposes LongPPL to focus on key tokens, and introduces LongCE for better model training.

Findings

01

LongPPL correlates strongly with long-context performance (Pearson -0.96)

02

Traditional PPL overlooks key tokens, misrepresenting model capabilities

03

LongCE improves model performance across benchmarks

Abstract

Handling long-context inputs is crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. The underlying cause of this limitation has remained unclear. In this work, we provide a comprehensive explanation for this issue. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose \textbf{LongPPL}, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Our experiments demonstrate that…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

The idea is simple and seems working well.

Weaknesses

n/a

Reviewer 02Rating 6Confidence 4

Strengths

This paper conducts a fine-grained analysis to investigate why perplexity (PPL) fails to capture long-context capabilities in LLMs. Few previous studies focus on re-weighting tokens to improve long-context performance; this work addresses this directly by re-weighting tokens based on their dependence on long-context information, providing a more fundamental and targeted solution.

Weaknesses

Some settings in the figures lack clear explanations, which affects the reliability of the figures and the conclusions drawn from them. For example, in Figure 2, the specific prompt lengths for each data point are not provided, making it difficult to interpret the statement, "Each point represents the results obtained from testing at a specific prompt length." Although the correlation arguments seem promising and qualitatively sound, they would benefit from added statistical rigor, such as p-val

Reviewer 03Rating 6Confidence 2

Strengths

- The introduction of LongPPL and LongCE represents an innovation in the evaluation and fine-tuning of LLMs for long-context tasks. These metrics address the limitations of traditional perplexity and Cross Entropy loss, providing more accurate and task-relevant assessments of model performance. - The concepts presented by the authors are easy to understand, including the motivation and method. The figures greatly help in understanding the concepts and the experiments.

Weaknesses

- LongPPL metric relies on a relatively strong model (medium-sized Qwen2-72B-Instruct) as an evaluator to identify key tokens. This dependence may limit the applicability of the metric in scenarios where such a strong model is not available or practical to use. - The additional computational cost associated with identifying key tokens may still be significant. This could be a barrier to adoption in resource-constrained environments.

Code & Models

Repositories

pku-ml/longppl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling