Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers

Hongzhan Lin; Zhiqi Bai; Xinmiao Zhang; Sen Yang; Xiang Li; Siran Yang; Yunlong Xu; Jiaheng Liu; Yongchi Zhao; Jiamang Wang; Yuchi Xu; Wenbo Su; and Bo Zheng

arXiv:2512.03870·cs.CL·February 20, 2026

Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers

Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li, Siran Yang, Yunlong Xu, Jiaheng Liu, Yongchi Zhao, Jiamang Wang, Yuchi Xu, Wenbo Su, and Bo Zheng

PDF

Open Access 3 Reviews

TL;DR

This paper introduces FusedKV and FusedKV-Lite, innovative methods for cross-layer KV cache fusion in transformers, significantly reducing memory usage while maintaining or improving performance on large language models.

Contribution

It proposes learnable cross-layer KV cache fusion techniques that enhance memory efficiency and performance in transformer decoders, addressing KV cache bottlenecks at long sequence lengths.

Findings

01

FusedKV reduces cache memory by 50% while lowering perplexity.

02

FusedKV-Lite further decreases I/O overhead with slight perplexity increase.

03

Experiments on models from 332M to 4B parameters validate effectiveness.

Abstract

Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- The asymmetric key-value sharing principle is a clear, well-motivated insight grounded in empirical analysis (Figure 2), distinguishing this work from prior indiscriminate cross-layer approaches. - Solid experimental validation across multiple model sizes (332M to 4B) with consistent gains over baselines. Validation loss curves (Figure 6) demonstrate stability across training. - Practical design: operating on post-RoPE keys elegantly preserves positional information without computational overh

Weaknesses

- Missing fundamental architectural comparison: This paper does not compare against MLA, which reduces cache dimensionality by design rather than reconstructing across layers. For practitioners optimizing cache size, adopting MLA is a more principled solution than reconstructing caches across standard multi-head attention layers. Direct comparison (accuracy, memory, inference speed) would clarify when cross-layer fusion is preferable to architectural redesign -- and particularly the potential of

Reviewer 02Rating 4Confidence 4

Strengths

- Drawing insights from preliminary experiment (dense fusion) seems a good research direction to build up their own methods. - Measure various metrics to compare methods. - Validation at scale (larger model sizes or larger training token numbers) seems good.

Weaknesses

- I feel like FusedKV is not a good approach due to its lower throughput, which comes from its higher KV IO, and marginal quality differences over FusedKV-Lite. - Proposed method is too heuristic. Although I personally like heuristic, simple yet effective method, I'm not sure about the generalizability of this proposed method. Using the first and middle layers as the source layer can be not optimal for other architectures (or with other modality). It seems like well-tailored, specific sharing st

Reviewer 03Rating 4Confidence 4

Strengths

* The proposed method reduces KV cache memory usage by 50% without any performance drop relative to the baseline. * The observation of KV cache asymmetry appears to be novel within the compression literature.

Weaknesses

* The overall approach lacks conceptual novelty. The idea of cache reuse has been explored in several prior works [1, 2] in a post-training setting, where cache similarity was used as the criterion for reuse. Approaches based on linear predictors [3] have achieved 4x compression with no performance drop and up to 8x compression with only minor degradation. * The idea of a weighted fusion of previous keys and values also appeared in [4] in the context of improving gradient flow. It is possible t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Network Packet Processing and Optimization