Is Compression Really Linear with Code Intelligence?

Shijie Xuyang; Xianzhen Luo; Zheng Chu; Houyi Li; Siming Huang; Qiufeng Wang; Wanxiang Che; Qingfu Zhu; Shuigeng Zhou

arXiv:2505.11441·cs.CL·March 27, 2026

Is Compression Really Linear with Code Intelligence?

Shijie Xuyang, Xianzhen Luo, Zheng Chu, Houyi Li, Siming Huang, Qiufeng Wang, Wanxiang Che, Qingfu Zhu, Shuigeng Zhou

PDF

Open Access

TL;DR

This paper investigates the relationship between data compression and code intelligence in large language models, revealing a fundamental logarithmic correlation rather than a linear one, and introduces a new evaluation methodology.

Contribution

It introduces Format Annealing for fair evaluation of pre-trained code LLMs and demonstrates a logarithmic relationship between compression and code intelligence.

Findings

01

Logarithmic relationship between bits-per-character and code intelligence

02

Introduction of Format Annealing for model evaluation

03

New large-scale code validation set from GitHub

Abstract

Understanding the relationship between data compression and the capabilities of Large Language Models (LLMs) is crucial, especially in specialized domains like code intelligence. Prior work posited a linear relationship between compression and general intelligence. However, it overlooked the multifaceted nature of code that encompasses diverse programming languages and tasks, and struggled with fair evaluation of modern Code LLMs. We address this by evaluating a diverse array of open-source Code LLMs on comprehensive multi-language, multi-task code benchmarks. To address the challenge of efficient and fair evaluation of pre-trained LLMs' code intelligence, we introduce \textit{Format Annealing}, a lightweight, transparent training methodology designed to assess the intrinsic capabilities of these pre-trained models equitably. Compression efficacy, measured as bits-per-character (BPC),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputability, Logic, AI Algorithms · Software Engineering Research

MethodsSparse Evolutionary Training