StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference
Zhirui Chen, Peiyang Liu, Ling Shao

TL;DR
StructKV is a novel framework that enhances long-context inference in large language models by preserving global information hubs and optimizing compression strategies, enabling scalable and efficient processing of over one million tokens.
Contribution
It introduces a structure-aware compression method using global attention analysis, dynamic layer selection, and separation of computation and memory, improving long-range dependency preservation.
Findings
Outperforms existing methods on LongBench and RULER benchmarks.
Effectively preserves long-range dependencies and robustness.
Reduces memory and bandwidth bottlenecks in long-context inference.
Abstract
As Large Language Models (LLMs) scale to support context windows exceeding one million tokens, the linear growth of Key-Value (KV) cache imposes severe memory capacity and bandwidth bottlenecks, constraining the efficiency of long-context inference. Existing compression approaches typically prioritize tokens based on local saliency metrics to decouple prefill computation from decoding memory. However, these methods often rely on local saliency snapshots at a specific layer, thereby systematically discarding tokens that act as global information hubs across the network depth but appear temporarily dormant at the specific layer selected for pruning. To address this limitation, we propose StructKV, a structure-aware KV cache compression framework that introduces three core innovations: First, Global In-Degree Centrality aggregates attention patterns across the network depth to identify…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
