DGRO: Enhancing LLM Reasoning via Exploration-Exploitation Control and Reward Variance Management

Xuerui Su; Liya Guo; Yue Wang; Yi Zhu; Zhiming Ma; Zun Wang; Yuting Liu

arXiv:2505.12951·cs.LG·May 20, 2025

DGRO: Enhancing LLM Reasoning via Exploration-Exploitation Control and Reward Variance Management

Xuerui Su, Liya Guo, Yue Wang, Yi Zhu, Zhiming Ma, Zun Wang, Yuting Liu

PDF

Open Access

TL;DR

This paper introduces DGRO, a novel RL algorithm for LLM reasoning that decouples exploration and exploitation controls and manages reward variance, leading to improved reasoning performance and faster convergence.

Contribution

DGRO is a new RL method that independently tunes exploration and exploitation, and incorporates reward variance management, enhancing LLM reasoning capabilities.

Findings

01

Achieves 96.9% accuracy on Logic dataset

02

Outperforms existing methods on mathematical benchmarks

03

Demonstrates improved convergence speed and generalization

Abstract

Inference scaling further accelerates Large Language Models (LLMs) toward Artificial General Intelligence (AGI), with large-scale Reinforcement Learning (RL) to unleash long Chain-of-Thought reasoning. Most contemporary reasoning approaches usually rely on handcrafted rule-based reward functions. However, the tarde-offs of exploration and exploitation in RL algorithms involves multiple complex considerations, and the theoretical and empirical impacts of manually designed reward functions remain insufficiently explored. In this paper, we propose Decoupled Group Reward Optimization (DGRO), a general RL algorithm for LLM reasoning. On the one hand, DGRO decouples the traditional regularization coefficient into two independent hyperparameters: one scales the policy gradient term, and the other regulates the distance from the sampling policy. This decoupling not only enables precise control…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings