DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts
Jiading Gai, Shuai Zhang, Xiang Song, Bernie Wang, George Karypis

TL;DR
DualKV introduces a novel FlashAttention kernel variant that eliminates shared prompt token duplication during RL training, significantly improving efficiency and enabling larger batch sizes.
Contribution
It proposes DualKV, a kernel-level solution that reduces redundant computation in shared-prompt RL training, achieving substantial speedups without approximation.
Findings
Achieves up to 2.09x policy-update speedup on Qwen3-8B
Raises MFU from 36% to 76% in large-scale RL training
Enables 3.82x policy-update speedup at 30B MoE scale
Abstract
Modern RL post-training methods such as GRPO and DAPO train on response sequences of tokens sampled from a shared prompt of tokens, but standard FlashAttention replicates all prompt tokens times across both forward and backward passes -- duplicating compute and memory on identical hidden states. In large-rollout, long-context RL training (, ), this redundancy dominates the policy update cost. We observe that in decoder-only models, causal masking makes prompt representations invariant across sequences at every layer, so all per-token operations (norms, projections, MLP) and attention can process the prompt once -- a property not yet exploited at the kernel level for training. We propose \textbf{DualKV}, the first FlashAttention kernel variant that eliminates shared-prompt replication during RL training, via (1)~fused CUDA forward and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
