SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization

Zhi Zheng; Yu Gu; Wei Liu; Yee Whye Teh; Wee Sun Lee

arXiv:2511.06411·cs.AI·January 30, 2026

SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization

Zhi Zheng, Yu Gu, Wei Liu, Yee Whye Teh, Wee Sun Lee

PDF

Open Access 1 Models

TL;DR

This paper introduces SofT-GRPO, a novel reinforcement learning algorithm that enhances soft-thinking reasoning in large language models by using Gumbel-Softmax and reparameterization, leading to improved performance over traditional discrete-token methods.

Contribution

The paper proposes SofT-GRPO, a new policy optimization method that effectively trains soft-thinking LLMs, overcoming previous challenges with stochasticity and policy updates.

Findings

01

SofT-GRPO slightly outperforms discrete-token GRPO on Pass@1 accuracy.

02

SofT-GRPO significantly improves Pass@32 accuracy.

03

Experiments conducted on LLMs from 1.5B to 7B parameters.

Abstract

The soft-thinking paradigm for Large Language Model (LLM) reasoning can outperform the conventional discrete-token Chain-of-Thought (CoT) reasoning in some scenarios, underscoring its research and application value. However, while the discrete-token CoT reasoning pattern can be reinforced through policy optimization algorithms such as group relative policy optimization (GRPO), extending the soft-thinking pattern with Reinforcement Learning (RL) remains challenging. This difficulty stems from the complexities of injecting stochasticity into soft-thinking tokens and updating soft-thinking policies accordingly. As a result, previous attempts to combine soft-thinking with GRPO typically underperform their discrete-token GRPO counterparts. To fully unlock the potential of soft-thinking, this paper presents a novel policy optimization algorithm, SofT-GRPO, to reinforce LLMs under the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
zz1358m/SofT-GRPO-master
model· ♡ 8
♡ 8

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Materials Science