How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment

Rui Zhu; Weiheng Bai; Qiushi Wu; Yang Ren; Haixu Tang; Yuchu Liu

arXiv:2605.06850·cs.LG·May 11, 2026

How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment

Rui Zhu, Weiheng Bai, Qiushi Wu, Yang Ren, Haixu Tang, Yuchu Liu

PDF

TL;DR

This paper introduces Shadow Mask Distillation, a novel method to compress KV caches in RL training of LLMs, reducing memory usage while addressing bias issues caused by compression.

Contribution

It proposes a new shadow mask distillation technique that enables memory-efficient KV cache compression in RL without introducing significant bias.

Findings

01

KV cache compression reduces memory footprint during RL training.

02

Shadow Mask Distillation maintains training stability despite compression.

03

The method achieves efficient memory use with minimal impact on RL performance.

Abstract

Reinforcement Learning (RL) has emerged as a crucial paradigm for unlocking the advanced reasoning capabilities of Large Language Models (LLMs), encompassing frameworks like RLHF and RLAIF. Regardless of the specific optimization algorithm (e.g., PPO, GRPO, or Online DPO), online RL inherently requires an exploratory trajectory generation (rollout) phase. However, for long-context reasoning tasks, this rollout phase imposes a severe ``memory wall'' due to the exorbitant Key-Value (KV) cache footprint. While applying KV cache compression during rollouts mitigates this memory overhead, it induces a critical off-policy bias. Although modern KV compression is often nearly lossless during standard inference, even minuscule approximation errors are drastically amplified by the inherent instability of RL optimization. Specifically, the sampler generates responses under a sparse context,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.