Proxy Compression for Language Modeling

Lin Zheng; Xinyu Li; Qian Liu; Xiachong Feng; Lingpeng Kong

arXiv:2602.04289·cs.CL·May 15, 2026

Proxy Compression for Language Modeling

Lin Zheng, Xinyu Li, Qian Liu, Xiachong Feng, Lingpeng Kong

PDF

1 Repo

TL;DR

Proxy compression allows language models to be trained on compressed data while maintaining raw-byte inference, improving efficiency and performance over traditional tokenization methods.

Contribution

Introduces a novel training scheme that aligns compressed sequences with raw bytes, enabling efficient, tokenizer-free language modeling.

Findings

01

Significantly improves training efficiency on code language modeling tasks.

02

Outperforms pure byte-level baselines within fixed compute budgets.

03

Matches or surpasses tokenizer-based models as scale increases.

Abstract

Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, a single language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs that are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LZhengisme/proxy-compression
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.