Long Context Compression with Activation Beacon
Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng, Dou

TL;DR
Activation Beacon introduces a novel plugin for transformer-based LLMs that compresses long contexts by directly reducing activations, enabling efficient processing with minimal performance loss across various long-context tasks.
Contribution
It proposes a new activation compression method that directly reduces key and value activations, supporting flexible compression ratios and maintaining high performance on long-context tasks.
Findings
Achieves 2x faster inference and 8x less memory usage.
Maintains comparable performance to uncompressed models on long-context tasks.
Effectively handles contexts far exceeding training length limits.
Abstract
Long context compression is a critical research problem due to its significance in reducing the high computational and memory costs associated with LLMs. In this paper, we propose Activation Beacon, a plug-in module for transformer-based LLMs that targets effective, efficient, and flexible compression of long contexts. To achieve this, our method introduces the following technical designs. 1) We directly compress the activations (i.e. keys and values at every layer), rather than leveraging soft prompts to relay information (which constitute a major bottleneck to encapsulate the complex information within long contexts). 2) We tailor the compression workflow, where each fine-grained input unit is progressively compressed, enabling high-quality compression and efficient computation during both training and inference. 3) We train the model through compression-based auto-regression, making…
Peer Reviews
Decision·ICLR 2025 Poster
- Compressing by chunks at each layer avoids the need for recomputation and addresses gradient back-propagation challenges present in some prior baselines that rely on recursive dependencies from final-layer outputs. This design enhances both training and inference efficiency. - The chunking approach and the interleaved insertion of beacon tokens are straightforward and intuitive. - Evaluations on various benchmarks indicate that the proposed approach generally outperforms the KV cache compressi
- In addition to LongBench and NIAH, it is essential to evaluate the proposed approach on newer, more challenging benchmarks, such as RULER [1]. - Some recent context compression baselines, including CEPE [2] and LLoCO [3], are not discussed in the paper and should be included for a more comprehensive discussion or comparison. [1] Hsieh et al. RULER: What's the Real Context Size of Your Long-Context Language Models? COLM 2024. [2] Yen et al. Long-Context Language Modeling with Parallel Contex
1. Activation Beacon reduces inference time by 2x and KV cache memory costs by 8x compared to the uncompressed baseline. 2. The method supports adaptive compression ratios, allowing flexibility for different tasks and contexts. 3. The proposed model maintains short-context capabilities, preserving the performance of the original LLM.
1. The performance of this method may vary with model size. Current evaluations focus on medium-sized models, lacking validation on larger-scale models, leaving its effectiveness and applicability in very large models underexplored. 2. The added complexity of managing beacon tokens and compression ratios increases implementation overhead for end-users, particularly when adapting to different tasks. In addition to actual inference latency, specific memory usage data across implementations would
- The paper presents an efficient method to compress long contexts, reducing memory usage by up to 8x and speeding up inference by 2x. - Its progressive, fine-grained compression approach maintains high compression quality, allowing the model to handle longer inputs than its built-in context window. -It supports flexible compression ratios, preserving model performance across various long-context tasks without degrading short-context capabilities.
- Lack of Comparison with KIVI: The paper does not provide a direct comparison with KIVI, a relevant compression method that could offer insights into the performance trade-offs. - GPU Time Omission: The paper does not report GPU training or inference time, leaving uncertainty around the practical computational cost and efficiency of the proposed method. - Scalability Concerns: The method requires 8 A800 GPUs to train a 7B parameter model, raising concerns about its scalability to larger models
Code & Models
- 🤗BAAI/bge-small-en-v1.5model· 12.0M dl· ♡ 43012.0M dl♡ 430
- 🤗BAAI/bge-large-en-v1.5model· 6.6M dl· ♡ 6386.6M dl♡ 638
- 🤗BAAI/bge-base-en-v1.5model· 5.5M dl· ♡ 4085.5M dl♡ 408
- 🤗BAAI/bge-reranker-basemodel· 2.3M dl· ♡ 2272.3M dl♡ 227
- 🤗BAAI/bge-large-zh-v1.5model· 610k dl· ♡ 616610k dl♡ 616
- 🤗BAAI/bge-reranker-largemodel· 790k dl· ♡ 454790k dl♡ 454
- 🤗lee0321/bge-large-zh-v1.5model· 1 dl1 dl
- 🤗avsolatorio/01-100-11-1-2-2-0-0-cls-normed-384-512_GIST_BAAI_bge-small-en-v1.5-20240202154404-bestmodel· 291 dl291 dl
- 🤗avsolatorio/01-100-11-1-2-2-0-0-cls-normed-384-512_GIST_BAAI_bge-small-en-v1.5-20240202154404-latestmodel· 298 dl298 dl
- 🤗avsolatorio/01-100-11-1-2-2-0-0-cls-normed-384-512_GIST_BAAI_bge-small-en-v1.5-20240202160129-bestmodel· 297 dl297 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies
