TL;DR
Proxy compression allows language models to be trained on compressed data while maintaining raw-byte inference, improving efficiency and performance over traditional tokenization methods.
Contribution
Introduces a novel training scheme that aligns compressed sequences with raw bytes, enabling efficient, tokenizer-free language modeling.
Findings
Significantly improves training efficiency on code language modeling tasks.
Outperforms pure byte-level baselines within fixed compute budgets.
Matches or surpasses tokenizer-based models as scale increases.
Abstract
Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, a single language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs that are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
