Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

Cheng Wang; Qin Liu; Wenxuan Zhou; Muhao Chen

arXiv:2605.11538·cs.CL·May 13, 2026

Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

Cheng Wang, Qin Liu, Wenxuan Zhou, Muhao Chen

PDF

TL;DR

This paper introduces a covariance-aware, Gaussian-kernel reweighting method to improve Group Relative Policy Optimization (GRPO) for large language models, enhancing stability and reasoning performance.

Contribution

It proposes a hyperparameter-free, covariance-weighted optimization technique that dynamically down-weights extreme token updates to balance exploration and exploitation.

Findings

01

Improves downstream reasoning benchmark performance

02

Stabilizes entropy during training

03

Reduces instability caused by exploration-exploitation trade-off

Abstract

Group Relative Policy Optimization (GRPO) has emerged as a promising approach for improving the reasoning capabilities of large language models. However, it struggles to effectively balance the tradeoff between exploration and exploitation during training, often resulting in suboptimal performance. Motivated by the theoretical insight that changes in entropy are governed by the covariance between token probabilities and their corresponding advantages, we propose a hyperparameter-free, covariance-weighted optimization method that dynamically down-weights extreme token-level updates via a Gaussian kernel. This approach automatically reduces the instability caused by exploration-exploitation trade-off while preserving informative learning signals. Extensive empirical evaluations show that our approach improves downstream performance across reasoning benchmarks compared with GRPO, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.