Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation

Yeqin Zhang; Yizheng Zhao; Chen Hu; Binxing Jiao; Daxin Jiang; Ruihang Miao; Cam-Tu Nguyen

arXiv:2511.17129·cs.CL·December 25, 2025

Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation

Yeqin Zhang, Yizheng Zhao, Chen Hu, Binxing Jiao, Daxin Jiang, Ruihang Miao, Cam-Tu Nguyen

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel context compression pretext task for large language models, significantly improving their ability to generate compact, holistic text representations for various downstream tasks in an unsupervised manner.

Contribution

It proposes a new compression-based pretraining method that enhances LLMs' text representations and outperforms token-level methods, with improved efficiency and effectiveness.

Findings

01

Compression pretraining outperforms token-level pretext tasks

02

LLM2Comp achieves state-of-the-art results on multiple tasks

03

Method is more sample-efficient than existing approaches

Abstract

Text representation plays a critical role in tasks like clustering, retrieval, and other downstream applications. With the emergence of large language models (LLMs), there is increasing interest in harnessing their capabilities for this purpose. However, most of the LLMs are inherently causal and optimized for next-token prediction, making them suboptimal for producing holistic representations. To address this, recent studies introduced pretext tasks to adapt LLMs for text representation. Most of these tasks, however, rely on token-level prediction objectives, such as the masked next-token prediction (MNTP) used in LLM2Vec. In this work, we explore the untapped potential of context compression as a pretext task for unsupervised adaptation of LLMs. During compression pre-training, the model learns to generate compact memory tokens, which substitute the whole context for downstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare