LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing

Dongfang Li; Zixuan Liu; Gang Lin; Baotian Hu; Min Zhang

arXiv:2603.08453·cs.LG·March 10, 2026

LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing

Dongfang Li, Zixuan Liu, Gang Lin, Baotian Hu, Min Zhang

PDF

Open Access

TL;DR

LycheeCluster introduces a structure-aware hierarchical indexing method that significantly accelerates long-context inference in large language models while maintaining performance, addressing key computational challenges.

Contribution

It proposes a novel hierarchical KV cache management technique with boundary-aware chunking and recursive indexing, improving speed and efficiency over existing methods.

Findings

01

Achieves up to 3.6x inference speedup

02

Maintains negligible performance degradation

03

Outperforms state-of-the-art KV cache methods

Abstract

The quadratic complexity of the attention mechanism and the substantial memory footprint of the Key-Value (KV) cache present severe computational and memory challenges for Large Language Models (LLMs) processing long contexts. Existing retrieval-based methods often compromise semantic integrity through fixed-size chunking and suffer from inefficient linear scanning. In this paper, we propose LycheeCluster, a novel method for efficient KV cache management. LycheeCluster preserves local semantic coherence via boundary-aware chunking and constructs a recursive hierarchical index rooted in the triangle inequality. This design transforms cache retrieval from a linear scan into a theoretically bounded, logarithmic-time pruning process, while a lazy update strategy supports efficient streaming generation. Experiments demonstrate that LycheeCluster achieves up to a 3.6x end-to-end inference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Parallel Computing and Optimization Techniques · Big Data and Digital Economy