ParallelComp: Parallel Long-Context Compressor for Length Extrapolation

Jing Xiong; Jianghan Shen; Chuanyang Zheng; Zhongwei Wan; Chenyang Zhao; Chiwun Yang; Fanghua Ye; Hongxia Yang; Lingpeng Kong; Ngai Wong

arXiv:2502.14317·cs.CL·June 10, 2025

ParallelComp: Parallel Long-Context Compressor for Length Extrapolation

Jing Xiong, Jianghan Shen, Chuanyang Zheng, Zhongwei Wan, Chenyang Zhao, Chiwun Yang, Fanghua Ye, Hongxia Yang, Lingpeng Kong, Ngai Wong

PDF

Open Access

TL;DR

ParallelComp introduces a parallel long-context compression method that overcomes memory limitations, enabling large language models to efficiently extrapolate ultra-long contexts up to 128K tokens without retraining.

Contribution

It presents a novel parallel attention mechanism with dynamic chunk eviction, systematically analyzing attention biases and mitigating them for ultra-long context extrapolation.

Findings

01

Achieves 91.17% of GPT-4 performance on ultra-long contexts.

02

Provides 1.76x chunk throughput improvement.

03

Realizes 23.50x acceleration in prefill stage.

Abstract

Extrapolating ultra-long contexts (text length >128K) remains a major challenge for large language models (LLMs), as most training-free extrapolation methods are not only severely limited by memory bottlenecks, but also suffer from the attention sink, which restricts their scalability and effectiveness in practice. In this work, we propose ParallelComp, a parallel long-context compression method that effectively overcomes the memory bottleneck, enabling 8B-parameter LLMs to extrapolate from 8K to 128K tokens on a single A100 80GB GPU in a training-free setting. ParallelComp splits the input into chunks, dynamically evicting redundant chunks and irrelevant tokens, supported by a parallel KV cache eviction mechanism. Importantly, we present a systematic theoretical and empirical analysis of attention biases in parallel attention-including the attention sink, recency bias, and middle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHeat Transfer and Optimization · Embedded Systems and FPGA Design · Astronomical Observations and Instrumentation

MethodsSoftmax · Attention Is All You Need