ARC-Encoder: learning compressed text representations for large language models
Hippolyte Pilchen, Edouard Grave, Patrick P\'erez

TL;DR
ARC-Encoder is a versatile, efficient encoder that compresses context for large language models, reducing inference costs while maintaining state-of-the-art performance across various tasks and models.
Contribution
It introduces ARC-Encoder, a novel adaptable compression method that generalizes across multiple decoders without fine-tuning the entire model.
Findings
Achieves state-of-the-art results on several benchmarks.
Reduces computational costs during inference.
Works effectively across different LLMs and tasks.
Abstract
Recent techniques such as retrieval-augmented generation or chain-of-thought reasoning have led to longer contexts and increased inference costs. Context compression techniques can reduce these costs, but the most effective approaches require fine-tuning the target model or even modifying its architecture. This can degrade its general abilities when not used for this specific purpose. Here we explore an alternative approach: an encoder that compresses the context into continuous representations which replace token embeddings in decoder LLMs. First, we perform a systematic study of training strategies and architecture choices for the encoder. Our findings led to the design of an Adaptable text Representations Compressor, named ARC-Encoder, which outputs -times fewer continuous representations (typically ) than text tokens. We evaluate ARC-Encoder across a variety of…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This paper introduces a new formulation of context compression that does not alter the decoder. Unlike prior “memory token” or “gist token” methods, ARC-Encoder performs fixed-ratio pooling within the encoder’s attention layers and connects to decoders through a lightweight MLP. This architectural separation is elegant and conceptually clean. 2. The authors conduct a broad and fair evaluation across multiple domains. Results show consistent improvements over strong baselines, often matching o
1. This paper does not provide a deeper theoretical discussion of why pooled query averaging in attention preserves semantic fidelity or why it outperforms token-level compression. A brief analytical or representational argument could strengthen the paper’s foundation. 2. How sensitive is performance to the dimensionality of the MLP bottleneck?
ARC-Encoder does not require decoder modification, enabling adaptation to existing LLMs. For multi-decoder adaptation, only a small amount of parameters are needed, resulting in low deployment costs. It covers both short- and long-context tasks, and memory analysis supports precomputation, indicating great potential for practical application.
It has weak innovation: its framework is highly similar to ICAE, and there are no breakthrough designs in multi-decoder adaptation or long-context strategies. Furthermore, it fails to explore performance at high compression factors (16×/32×) and generalization in professional domains, nor does it provide comparisons of inference latency in real-world scenarios.
1. The proposed ARC-Encoder offers a solution for context compression, achieving this without altering the underlying LLM architecture. 2. The method demonstrates strong empirical results over different tasks.
1. The claim that ARC-Encoder “works seamlessly with multiple LLMs” is overstated, since in practice it still requires fine-tuning separate projectors for each target model, even if the number of parameters is small. 2. The encoder model is very large (e.g., ~3B parameters), which raises serious concerns about practical efficiency. The paper should provide detailed FLOPs and latency analyses to substantiate efficiency claims. 3. It is unclear how the text encoder’s embeddings are initialized,
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Multimodal Machine Learning Applications
