Long Context Compression with Activation Beacon

Peitian Zhang; Zheng Liu; Shitao Xiao; Ninglu Shao; Qiwei Ye; Zhicheng; Dou

arXiv:2401.03462·cs.CL·October 14, 2024·2 cites

Long Context Compression with Activation Beacon

Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng, Dou

PDF

Open Access 1 Repo 10 Models 3 Reviews

TL;DR

Activation Beacon introduces a novel plugin for transformer-based LLMs that compresses long contexts by directly reducing activations, enabling efficient processing with minimal performance loss across various long-context tasks.

Contribution

It proposes a new activation compression method that directly reduces key and value activations, supporting flexible compression ratios and maintaining high performance on long-context tasks.

Findings

01

Achieves 2x faster inference and 8x less memory usage.

02

Maintains comparable performance to uncompressed models on long-context tasks.

03

Effectively handles contexts far exceeding training length limits.

Abstract

Long context compression is a critical research problem due to its significance in reducing the high computational and memory costs associated with LLMs. In this paper, we propose Activation Beacon, a plug-in module for transformer-based LLMs that targets effective, efficient, and flexible compression of long contexts. To achieve this, our method introduces the following technical designs. 1) We directly compress the activations (i.e. keys and values at every layer), rather than leveraging soft prompts to relay information (which constitute a major bottleneck to encapsulate the complex information within long contexts). 2) We tailor the compression workflow, where each fine-grained input unit is progressively compressed, enabling high-quality compression and efficient computation during both training and inference. 3) We train the model through compression-based auto-regression, making…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 5

Strengths

- Compressing by chunks at each layer avoids the need for recomputation and addresses gradient back-propagation challenges present in some prior baselines that rely on recursive dependencies from final-layer outputs. This design enhances both training and inference efficiency. - The chunking approach and the interleaved insertion of beacon tokens are straightforward and intuitive. - Evaluations on various benchmarks indicate that the proposed approach generally outperforms the KV cache compressi

Weaknesses

- In addition to LongBench and NIAH, it is essential to evaluate the proposed approach on newer, more challenging benchmarks, such as RULER [1]. - Some recent context compression baselines, including CEPE [2] and LLoCO [3], are not discussed in the paper and should be included for a more comprehensive discussion or comparison. [1] Hsieh et al. RULER: What's the Real Context Size of Your Long-Context Language Models? COLM 2024. [2] Yen et al. Long-Context Language Modeling with Parallel Contex

Reviewer 02Rating 6Confidence 4

Strengths

1. Activation Beacon reduces inference time by 2x and KV cache memory costs by 8x compared to the uncompressed baseline. 2. The method supports adaptive compression ratios, allowing flexibility for different tasks and contexts. 3. The proposed model maintains short-context capabilities, preserving the performance of the original LLM.

Weaknesses

1. The performance of this method may vary with model size. Current evaluations focus on medium-sized models, lacking validation on larger-scale models, leaving its effectiveness and applicability in very large models underexplored. 2. The added complexity of managing beacon tokens and compression ratios increases implementation overhead for end-users, particularly when adapting to different tasks. In addition to actual inference latency, specific memory usage data across implementations would

Reviewer 03Rating 8Confidence 3

Strengths

- The paper presents an efficient method to compress long contexts, reducing memory usage by up to 8x and speeding up inference by 2x. - Its progressive, fine-grained compression approach maintains high compression quality, allowing the model to handle longer inputs than its built-in context window. -It supports flexible compression ratios, preserving model performance across various long-context tasks without degrading short-context capabilities.

Weaknesses

- Lack of Comparison with KIVI: The paper does not provide a direct comparison with KIVI, a relevant compression method that could offer insights into the performance trade-offs. - GPU Time Omission: The paper does not report GPU training or inference time, leaving uncertainty around the practical computational cost and efficiency of the proposed method. - Scalability Concerns: The method requires 8 A800 GPUs to train a 7B parameter model, raising concerns about its scalability to larger models

Code & Models

Repositories

flagopen/flagembedding
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies