SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization
Zhi Zheng, Yu Gu, Wei Liu, Yee Whye Teh, Wee Sun Lee

TL;DR
This paper introduces SofT-GRPO, a novel reinforcement learning algorithm that enhances soft-thinking reasoning in large language models by using Gumbel-Softmax and reparameterization, leading to improved performance over traditional discrete-token methods.
Contribution
The paper proposes SofT-GRPO, a new policy optimization method that effectively trains soft-thinking LLMs, overcoming previous challenges with stochasticity and policy updates.
Findings
SofT-GRPO slightly outperforms discrete-token GRPO on Pass@1 accuracy.
SofT-GRPO significantly improves Pass@32 accuracy.
Experiments conducted on LLMs from 1.5B to 7B parameters.
Abstract
The soft-thinking paradigm for Large Language Model (LLM) reasoning can outperform the conventional discrete-token Chain-of-Thought (CoT) reasoning in some scenarios, underscoring its research and application value. However, while the discrete-token CoT reasoning pattern can be reinforced through policy optimization algorithms such as group relative policy optimization (GRPO), extending the soft-thinking pattern with Reinforcement Learning (RL) remains challenging. This difficulty stems from the complexities of injecting stochasticity into soft-thinking tokens and updating soft-thinking policies accordingly. As a result, previous attempts to combine soft-thinking with GRPO typically underperform their discrete-token GRPO counterparts. To fully unlock the potential of soft-thinking, this paper presents a novel policy optimization algorithm, SofT-GRPO, to reinforce LLMs under the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Materials Science
