Information Capacity: Evaluating the Efficiency of Large Language Models via Text Compression

Cheng Yuan; Jiawei Shao; Xuelong Li

arXiv:2511.08066·cs.AI·March 11, 2026

Information Capacity: Evaluating the Efficiency of Large Language Models via Text Compression

Cheng Yuan, Jiawei Shao, Xuelong Li

PDF

Open Access 1 Datasets

TL;DR

This paper introduces 'information capacity', a new metric for evaluating large language models' inference efficiency based on text compression, accounting for tokenizer efficiency and correlating with model performance and linguistic biases.

Contribution

The paper proposes a novel efficiency metric for LLMs that incorporates tokenizer efficiency and correlates with model performance, aiding future model scaling and optimization.

Findings

01

Information capacity is consistent across models of different sizes within a series.

02

Strong linguistic biases are observed in mainstream LLMs.

03

Performance can be accurately predicted from information capacity.

Abstract

Recent years have witnessed the rapid advancements of large language models (LLMs) and their expanding applications, leading to soaring demands for computational resources. The widespread adoption of test-time scaling further intensifies the tension between model capability and resource consumption. However, a rigorous metric that accurately reflects an LLM's inference efficiency across diverse tokenizers, parameter counts, and model architectures remains absent. Motivated by the correlation between compression and intelligence, we introduce information capacity, a measure of model efficiency based on text compression performance relative to computational complexity. A distinctive feature of information capacity is its incorporation of tokenizer efficiency, which affects inference costs but is often neglected in LLM evaluations. We assess the information capacity of 56 open-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

TeleAI-AI-Flow/InformationCapacity
dataset· 429 dl
429 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Big Data and Digital Economy