DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

Jiading Gai; Shuai Zhang; Xiang Song; Bernie Wang; George Karypis

arXiv:2605.15422·cs.LG·May 18, 2026

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

Jiading Gai, Shuai Zhang, Xiang Song, Bernie Wang, George Karypis

PDF

TL;DR

DualKV introduces a novel FlashAttention kernel variant that eliminates shared prompt token duplication during RL training, significantly improving efficiency and enabling larger batch sizes.

Contribution

It proposes DualKV, a kernel-level solution that reduces redundant computation in shared-prompt RL training, achieving substantial speedups without approximation.

Findings

01

Achieves up to 2.09x policy-update speedup on Qwen3-8B

02

Raises MFU from 36% to 76% in large-scale RL training

03

Enables 3.82x policy-update speedup at 30B MoE scale

Abstract

Modern RL post-training methods such as GRPO and DAPO train on $N$ response sequences of $R$ tokens sampled from a shared prompt of $P$ tokens, but standard FlashAttention replicates all $P$ prompt tokens $N$ times across both forward and backward passes -- duplicating compute and memory on identical hidden states. In large-rollout, long-context RL training ( $N \geq 16$ , $P \geq 8 K$ ), this redundancy dominates the policy update cost. We observe that in decoder-only models, causal masking makes prompt representations invariant across sequences at every layer, so all per-token operations (norms, projections, MLP) and attention can process the prompt once -- a property not yet exploited at the kernel level for training. We propose \textbf{DualKV}, the first FlashAttention kernel variant that eliminates shared-prompt replication during RL training, via (1)~fused CUDA forward and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.