Rethinking Perplexity: Revealing the Impact of Input Length on Perplexity Evaluation in LLMs

Letian Cheng; Junyan Wang; Yan Gao; Elliott Wen; Ting Dang; Hong Jia

arXiv:2602.04099·cs.LG·February 5, 2026

Rethinking Perplexity: Revealing the Impact of Input Length on Perplexity Evaluation in LLMs

Letian Cheng, Junyan Wang, Yan Gao, Elliott Wen, Ting Dang, Hong Jia

PDF

Open Access

TL;DR

This paper introduces LengthBenchmark, a system-aware evaluation framework that systematically studies how input length affects perplexity and other metrics in large language models, revealing biases and deployment implications.

Contribution

It presents a novel evaluation framework that explicitly incorporates input length, evaluation protocols, and system costs, providing a more realistic assessment of LLM performance.

Findings

01

Sliding window evaluation inflates short input performance.

02

Model gains increase with longer input segments.

03

Length bias affects fair comparison across models.

Abstract

Perplexity is a widely adopted metric for assessing the predictive quality of large language models (LLMs) and often serves as a reference metric for downstream evaluations. However, recent evidence shows that perplexity can be unreliable, especially when irrelevant long inputs are used, raising concerns for both benchmarking and system deployment. While prior efforts have employed selective input filtering and curated datasets, the impact of input length on perplexity has not been systematically studied from a systems perspective and input length has rarely been treated as a first-class system variable affecting both fairness and efficiency. In this work, we close this gap by introducing LengthBenchmark, a system-conscious evaluation framework that explicitly integrates input length, evaluation protocol design, and system-level costs, evaluating representative LLMs under two scoring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Software System Performance and Reliability