ParallelComp: Parallel Long-Context Compressor for Length Extrapolation
Jing Xiong, Jianghan Shen, Chuanyang Zheng, Zhongwei Wan, Chenyang Zhao, Chiwun Yang, Fanghua Ye, Hongxia Yang, Lingpeng Kong, Ngai Wong

TL;DR
ParallelComp introduces a parallel long-context compression method that overcomes memory limitations, enabling large language models to efficiently extrapolate ultra-long contexts up to 128K tokens without retraining.
Contribution
It presents a novel parallel attention mechanism with dynamic chunk eviction, systematically analyzing attention biases and mitigating them for ultra-long context extrapolation.
Findings
Achieves 91.17% of GPT-4 performance on ultra-long contexts.
Provides 1.76x chunk throughput improvement.
Realizes 23.50x acceleration in prefill stage.
Abstract
Extrapolating ultra-long contexts (text length >128K) remains a major challenge for large language models (LLMs), as most training-free extrapolation methods are not only severely limited by memory bottlenecks, but also suffer from the attention sink, which restricts their scalability and effectiveness in practice. In this work, we propose ParallelComp, a parallel long-context compression method that effectively overcomes the memory bottleneck, enabling 8B-parameter LLMs to extrapolate from 8K to 128K tokens on a single A100 80GB GPU in a training-free setting. ParallelComp splits the input into chunks, dynamically evicting redundant chunks and irrelevant tokens, supported by a parallel KV cache eviction mechanism. Importantly, we present a systematic theoretical and empirical analysis of attention biases in parallel attention-including the attention sink, recency bias, and middle…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHeat Transfer and Optimization · Embedded Systems and FPGA Design · Astronomical Observations and Instrumentation
MethodsSoftmax · Attention Is All You Need
