Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li, Siran Yang, Yunlong Xu, Jiaheng Liu, Yongchi Zhao, Jiamang Wang, Yuchi Xu, Wenbo Su, and Bo Zheng

TL;DR
This paper introduces FusedKV and FusedKV-Lite, innovative methods for cross-layer KV cache fusion in transformers, significantly reducing memory usage while maintaining or improving performance on large language models.
Contribution
It proposes learnable cross-layer KV cache fusion techniques that enhance memory efficiency and performance in transformer decoders, addressing KV cache bottlenecks at long sequence lengths.
Findings
FusedKV reduces cache memory by 50% while lowering perplexity.
FusedKV-Lite further decreases I/O overhead with slight perplexity increase.
Experiments on models from 332M to 4B parameters validate effectiveness.
Abstract
Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To…
Peer Reviews
Decision·ICLR 2026 Poster
- The asymmetric key-value sharing principle is a clear, well-motivated insight grounded in empirical analysis (Figure 2), distinguishing this work from prior indiscriminate cross-layer approaches. - Solid experimental validation across multiple model sizes (332M to 4B) with consistent gains over baselines. Validation loss curves (Figure 6) demonstrate stability across training. - Practical design: operating on post-RoPE keys elegantly preserves positional information without computational overh
- Missing fundamental architectural comparison: This paper does not compare against MLA, which reduces cache dimensionality by design rather than reconstructing across layers. For practitioners optimizing cache size, adopting MLA is a more principled solution than reconstructing caches across standard multi-head attention layers. Direct comparison (accuracy, memory, inference speed) would clarify when cross-layer fusion is preferable to architectural redesign -- and particularly the potential of
- Drawing insights from preliminary experiment (dense fusion) seems a good research direction to build up their own methods. - Measure various metrics to compare methods. - Validation at scale (larger model sizes or larger training token numbers) seems good.
- I feel like FusedKV is not a good approach due to its lower throughput, which comes from its higher KV IO, and marginal quality differences over FusedKV-Lite. - Proposed method is too heuristic. Although I personally like heuristic, simple yet effective method, I'm not sure about the generalizability of this proposed method. Using the first and middle layers as the source layer can be not optimal for other architectures (or with other modality). It seems like well-tailored, specific sharing st
* The proposed method reduces KV cache memory usage by 50% without any performance drop relative to the baseline. * The observation of KV cache asymmetry appears to be novel within the compression literature.
* The overall approach lacks conceptual novelty. The idea of cache reuse has been explored in several prior works [1, 2] in a post-training setting, where cache similarity was used as the criterion for reuse. Approaches based on linear predictors [3] have achieved 4x compression with no performance drop and up to 8x compression with only minor degradation. * The idea of a weighted fusion of previous keys and values also appeared in [4] in the context of improving gradient flow. It is possible t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Network Packet Processing and Optimization
