SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization

Minghan Chen; Guikun Chen; Wenguan Wang; Yi Yang

arXiv:2505.12346·cs.AI·May 20, 2025

SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization

Minghan Chen, Guikun Chen, Wenguan Wang, Yi Yang

PDF

Open Access 3 Reviews

TL;DR

SEED-GRPO introduces a novel uncertainty-aware policy optimization method for large language models by measuring semantic entropy of answers, enabling dynamic policy updates and improving performance on reasoning benchmarks.

Contribution

The paper proposes SEED-GRPO, which incorporates semantic entropy to modulate policy updates based on input prompt uncertainty, enhancing LLM training effectiveness.

Findings

01

Achieves state-of-the-art accuracy on five reasoning benchmarks.

02

Effectively adjusts policy updates based on question uncertainty.

03

Improves model confidence and robustness in reasoning tasks.

Abstract

Large language models (LLMs) exhibit varying levels of confidence across input prompts (questions): some lead to consistent, semantically similar answers, while others yield diverse or contradictory outputs. This variation reflects LLM's uncertainty about the input prompt, a signal of how confidently the model understands a given problem. However, vanilla Group Relative Policy Optimization (GRPO) treats all prompts equally during policy updates, ignoring this important information about the model's knowledge boundaries. To address this limitation, we propose SEED-GRPO (Semantic Entropy EnhanceD GRPO), which explicitly measures LLMs' uncertainty of the input prompts semantic entropy. Semantic entropy measures the diversity of meaning in multiple generated answers given a prompt and uses this to modulate the magnitude of policy updates. This uncertainty-aware training mechanism enables…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1.The idea is clear and intuitive Using semantic entropy to control update strength is a natural way to make GRPO uncertainty-aware. It seems make sense. 2. The method is simple and clean. With no extra sampling cost and limited training cost, making it easy to integrate. 3. Experiment results. Experiments are strong and consistent across five math reasoning benchmarks; improvements over Dr.GRPO and other large baselines are convincing. Ablations are systematic and well presented, showing st

Weaknesses

1. Lacks details. semantic grouping is not clear, will it affect the final performance? 2. The entropy calculation The semantic entropy is computed only from final answers, will it be better if we also consider the entropy for the thinking process? 3. More benchmark results. Is it possible to extend the results on more benchmarks, not limited to math?

Reviewer 02Rating 4Confidence 4

Strengths

1. Quality: SEED-GRPO achieves state-of-the-art performance on average performance in five mathematical reasoning benchmarks with the Qwen2.5-Math backbone model. Over 15 baselines have been included for comparison. 2. Clarity: The paper is well written and easy to follow.

Weaknesses

1. Significance: The paper focuses exclusively on mathematical reasoning, where uncertainty and correctness are easy to define. It remains unclear how semantic entropy performs in open-ended or multimodal domains, where "semantic clusters" may not be easily defined. This introduces challenges for the algorithm to extend to more general scenarios. 2. Novelty: Although the paper claims it is the first paper to incorporate uncertainty into GRPO, the actual implementation is essentially reweighting

Reviewer 03Rating 4Confidence 3

Strengths

The strengths of this paper is shown as follows 1. This paper propose SEED-GRPO, which mitigate the issue of training on potential harmful prompt in a simple way. 2. The experiments are conducted on three models and tested on five datasets, and the results look promising. 3. The paper is clearly written and easy to follow.

Weaknesses

The weaknesses of this paper are listed as follows 1. The configuration of the baselines are unclear. It looks like that the paper simply integrate a bunch of off-the-shelf models trained with diferent algorithms as the baselines. However, it is unclear whether these models are trained from the same base model and with a same dataset. Given this, it is hard to conclude whether SEED-GRPO really outperform the baselines 2. In the experiment setup, the maximum output is set to 3000 tokens. Howeve

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications