Large Language Model as Token Compressor and Decompressor
Wenbing Li, Yiran Wang, Zikai Song, Jielei Zhang, Tianhao Zhao, Junkai Lin, Wei Yang

TL;DR
This paper introduces a method to adapt large language models into token compressors and decompressors, enabling efficient long-context processing by encoding texts into compact latent codes with minimal performance loss.
Contribution
It presents a self-expressive autoencoding framework using LoRA adapters to create content-adaptive, variable-length token compression for long texts.
Findings
Preserves reconstruction quality on long-context datasets
Reduces memory usage and latency during generation
Supports direct decoding and autoregressive generation in compressed space
Abstract
In this paper, we study whether an off-the-shelf LLM can be adapted into a discrete, variable-length token compressor and decompressor for long-context processing. To this end, we design a self-expressive autoencoding framework that fine-tunes a pretrained LLM with lightweight LoRA adapters to map long texts into compact sequences of learned latent codes, termed Z-tokens, and to decode them back into natural language or task outputs. The resulting representation is content-adaptive: less predictable or information-dense segments can receive more Z-tokens, while redundant regions can be represented more compactly through a budget-aware length regularizer. Our method is evaluated on long-context datasets such as Wikipedia, CNN/DailyMail, HotpotQA, and QuALITY, showing that it preserves reconstruction quality and downstream performance while reducing effective context length,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
