Is Compression Really Linear with Code Intelligence?
Shijie Xuyang, Xianzhen Luo, Zheng Chu, Houyi Li, Siming Huang, Qiufeng Wang, Wanxiang Che, Qingfu Zhu, Shuigeng Zhou

TL;DR
This paper investigates the relationship between data compression and code intelligence in large language models, revealing a fundamental logarithmic correlation rather than a linear one, and introduces a new evaluation methodology.
Contribution
It introduces Format Annealing for fair evaluation of pre-trained code LLMs and demonstrates a logarithmic relationship between compression and code intelligence.
Findings
Logarithmic relationship between bits-per-character and code intelligence
Introduction of Format Annealing for model evaluation
New large-scale code validation set from GitHub
Abstract
Understanding the relationship between data compression and the capabilities of Large Language Models (LLMs) is crucial, especially in specialized domains like code intelligence. Prior work posited a linear relationship between compression and general intelligence. However, it overlooked the multifaceted nature of code that encompasses diverse programming languages and tasks, and struggled with fair evaluation of modern Code LLMs. We address this by evaluating a diverse array of open-source Code LLMs on comprehensive multi-language, multi-task code benchmarks. To address the challenge of efficient and fair evaluation of pre-trained LLMs' code intelligence, we introduce \textit{Format Annealing}, a lightweight, transparent training methodology designed to assess the intrinsic capabilities of these pre-trained models equitably. Compression efficacy, measured as bits-per-character (BPC),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputability, Logic, AI Algorithms · Software Engineering Research
MethodsSparse Evolutionary Training
