YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference

You Wu; Ziheng Chen; Yizhen Zhang; Haoyi Wu; Chengting Yu; Yuchi Xu; Wenbo Su; Bo Zheng; Kewei Tu

arXiv:2604.13556·cs.CL·April 16, 2026

YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference

You Wu, Ziheng Chen, Yizhen Zhang, Haoyi Wu, Chengting Yu, Yuchi Xu, Wenbo Su, Bo Zheng, Kewei Tu

PDF

TL;DR

YOCO++ improves large language model inference efficiency by enhancing cross-layer KV compression with residual connections, achieving state-of-the-art performance at 50% cache compression.

Contribution

It introduces YOCO++, a novel method that adds residual connections to YOCO, increasing model capacity without sacrificing efficiency.

Findings

01

YOCO++ outperforms standard Transformer at 50% KV cache compression.

02

YOCO++ maintains training and inference efficiency while increasing model capacity.

03

State-of-the-art performance among cross-layer KV compression methods.

Abstract

Cross-layer key-value (KV) compression has been found to be effective in efficient inference of large language models (LLMs). Although they reduce the memory consumption of the KV cache, such methods usually introduce non-negligible performance degradation. In this work, we aim to enhance the performance of YOCO, a cross-layer KV compression method that shares the KVs of the middle layer with the top-half layers. We propose YOCO++, an enhanced YOCO that incorporates a weighted residual connection between the KVs of each bottom-half layer and the bottom layer. Compared to YOCO, YOCO++ increases model capacity while maintaining the same training and inference efficiency. Our experiments show that YOCO++ achieves state-of-the-art performance among the cross-layer KV compression methods at a 50% KV cache compression rate, outperforming the standard Transformer.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.