Know When to Explore: Difficulty-Aware Certainty as a Guide for LLM Reinforcement Learning

Ang Li; Zhihang Yuan; Yang Zhang; Shouda Liu; Yisen Wang

arXiv:2509.00125·cs.AI·September 3, 2025

Know When to Explore: Difficulty-Aware Certainty as a Guide for LLM Reinforcement Learning

Ang Li, Zhihang Yuan, Yang Zhang, Shouda Liu, Yisen Wang

PDF

Open Access

TL;DR

This paper introduces DACE, a reinforcement learning algorithm for LLMs that uses self-assessed certainty to adaptively balance exploration and exploitation, improving reasoning performance on challenging benchmarks.

Contribution

DACE leverages LLMs' self-certainty as a dynamic signal to guide exploration, addressing the limitations of outcome-based rewards in reinforcement learning.

Findings

01

DACE outperforms strong baselines on mathematical reasoning benchmarks.

02

Models trained with DACE achieve higher accuracy and robustness.

03

Adaptive exploration improves learning efficiency without sacrificing precision.

Abstract

Reinforcement Learning with Verifiable Feedback (RLVF) has become a key technique for enhancing the reasoning abilities of Large Language Models (LLMs). However, its reliance on sparse, outcome based rewards, which only indicate if a final answer is correct or not, fails to provide granular guidance on the reasoning process itself. This limitation hinders efficient learning, as the model cannot distinguish between high quality and inefficient solutions, nor can it learn effectively from different types of failures. To address this, we observe that an LLMs self-certainty often correlates with task difficulty and solution quality. We introduce Difficulty Aware Certainty guided Exploration (DACE), a novel RL algorithm that leverages this insight to dynamically balance the exploration exploitation trade-off. DACE assesses task difficulty online based on the policys success rate. It then…

Tables8

Table 1. Table 1 : Quantitative evaluation of fixed certainty strategies. The table reports the final expected reward for different α \alpha values under varying task difficulties (controlled by σ r 1 \sigma_{r}^{1} ). The highest reward for each difficulty is in bold . The optimal α \alpha transitions from positive (exploration) to negative (exploitation) as the task becomes easier (larger σ r 1 \sigma_{r}^{1} ).

	Task Difficulty (Width of Reward Distribution $σ_{r}^{1}$ )
$α$	0.6	0.7	0.8	0.9	1.0	1.1	1.2
-0.1	$3.19 × 10^{- 22}$	$1.92 × 10^{- 16}$	$0.001 77$	$0.0405$	0.289	0.797	0.949
-0.05	$3.57 × 10^{- 22}$	$1.11 × 10^{- 14}$	$0.004 41$	$0.0573$	$0.285$	$0.611$	$0.926$
0.0	$4.40 × 10^{- 22}$	$3.46 × 10^{- 16}$	$0.004 11$	$0.0776$	$0.233$	$0.523$	$0.878$
0.05	5.34e-22	0.000911	0.0106	0.0972	$0.218$	$0.404$	$0.755$

Table 2. Table 2 : Performance comparison on mathematical reasoning benchmarks. DACE consistently improves over the GRPO baseline and achieves state-of-the-art results on AIME25 and AMC23. ∗ denotes results reported in the original papers. The highest score for each benchmark is marked in bold .

Datasets	AIME25	AIME24	AMC23	MATH-500
Qwen2.5-7B
Base	2.2	5.2	28.3	54.4
GRPO	15.9	14.6	70.9	82.1
w/ Ent-Adv. ^∗	11.8	12.6	57.8	58.5
w/ Clip-Cov ^∗	15.8	22.1	58.2	80.4
w/ KL-Cov ^∗	12.9	22.6	61.4	80.8
w/ FR3E ^∗	–	25.2	67.5	79.0
w/ DACE (ours)	17.2 ( $↑$ +1.3)	17.5 ( $↑$ +2.9)	71.4 ( $↑$ +0.5)	81.9 ( $↓$ -0.2)

Table 3. Table 3 : Influence of the difficulty threshold ( β threshold \beta_{\text{threshold}} ) and reward weight ( α scale \alpha_{\text{scale}} ) on final accuracy (%).

Threshold ( $β$ )	AIME25	AIME24	AMC23	MATH500	Average
0.0	$15.36$	$14.11$	$70.76$	$81.51$	$45.44$
0.2	$14.22$	$13.70$	$71.23$	$81.93$	$45.27$
0.4	$17.19$	$17.45$	$71.37$	$81.56$	$46.89$
0.6	$14.17$	$15.36$	$74.49$	$81.31$	$46.33$
0.8	$16.41$	$20.68$	$71.42$	$81.88$	47.60
1.0	$16.21$	$15.58$	$70.31$	$84.60$	$46.68$

Table 4. (a) Accuracy (%) with different difficulty thresholds ( α scale \alpha_{\text{scale}} fixed at 0.05).

Threshold ( $β$ )	AIME25	AIME24	AMC23	MATH500	Average
0.0	$15.36$	$14.11$	$70.76$	$81.51$	$45.44$
0.2	$14.22$	$13.70$	$71.23$	$81.93$	$45.27$
0.4	$17.19$	$17.45$	$71.37$	$81.56$	$46.89$
0.6	$14.17$	$15.36$	$74.49$	$81.31$	$46.33$
0.8	$16.41$	$20.68$	$71.42$	$81.88$	47.60
1.0	$16.21$	$15.58$	$70.31$	$84.60$	$46.68$

Table 5. (b) Accuracy (%) with different weights ( β threshold \beta_{\text{threshold}} fixed at 0.4).

Weight ( $α$ )	AIME25	AIME24	AMC23	MATH500	Average
0.05	$17.19$	$17.45$	$71.37$	$81.56$	$46.89$
0.10	$14.69$	$18.91$	$72.63$	$82.44$	47.17

Table 6. Table 4 : Hyperparameter configurations for the toy model experiment.

Hyperparameter	Value
Learning Rate ( $η$ )	0.01
PPO Clipping Epsilon ( $ϵ$ )	0.2
Epochs per Update	10
Batch Size	64
Total Training Iterations	34
Steps per Iteration	32

Table 7. Table 5 : Hyperparameter configurations for RL training.

Hyperparameter	Value
Sampling Temperature	0.6
Max Generation Length	8192
Training Epochs	20
Learning Rate	1e-6
Group Size ( $n$ )	16
Global Batch Size	512
PPO Minibatch Size	32

Table 8. Table 6 : Hyperparameter configurations for evaluation sampling.

Hyperparameter	Value
Sampling Temperature	0.6
Max Generation Length	8192
Min-p Sampling	0.95
Top-k Sampling	30

Equations14

ar g max E_{π} [r_{e x t} + β r_{in t}],

ar g max E_{π} [r_{e x t} + β r_{in t}],

\max_{\mu,\sigma}\mathbb{E}_{a\sim\pi(\cdot|\mu,\sigma)}\big{[}{\mathcal{R}}(a)+\alpha\ln\sigma\big{]},

\max_{\mu,\sigma}\mathbb{E}_{a\sim\pi(\cdot|\mu,\sigma)}\big{[}{\mathcal{R}}(a)+\alpha\ln\sigma\big{]},

\text{diff}(x;\pi)=1-\mathbb{E}_{y\sim\pi(\cdot|x)}\I\big{[}\text{verify}(y)=1\big{]}\approx 1-\frac{1}{n}\sum_{i=1}^{n}\I[\text{verify}(y_{i})=1].

\text{diff}(x;\pi)=1-\mathbb{E}_{y\sim\pi(\cdot|x)}\I\big{[}\text{verify}(y)=1\big{]}\approx 1-\frac{1}{n}\sum_{i=1}^{n}\I[\text{verify}(y_{i})=1].

C (y, x; π) = - \frac{1}{∣ y ∣} j = 1 \sum ∣ y ∣ lo g π (y_{j} ∣ x, y_{< j}) .

C (y, x; π) = - \frac{1}{∣ y ∣} j = 1 \sum ∣ y ∣ lo g π (y_{j} ∣ x, y_{< j}) .

R_{int} (x, y; π) = α (x; π) \cdot C (y, x; π),

R_{int} (x, y; π) = α (x; π) \cdot C (y, x; π),

α (x; π) = α_{scale} \cdot sgn (β_{threshold} - diff (x; π)) .

α (x; π) = α_{scale} \cdot sgn (β_{threshold} - diff (x; π)) .

π max E_{x \sim D, y \sim π (\cdot ∣ x)} [R_{ext} (x, y) + α (x; π) C (y, x; π)],

π max E_{x \sim D, y \sim π (\cdot ∣ x)} [R_{ext} (x, y) + α (x; π) C (y, x; π)],

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications

Full text

1]PKU 2]ByteDance Seed

Know When to Explore: Difficulty-Aware Certainty as a Guide for LLM Reinforcement Learning

Ang Li †

Zhihang Yuan ‡

Yang Zhang

Shouda Liu

Yisen Wang

[

(August 29, 2025)

Abstract

Reinforcement Learning with Verifiable Feedback (RLVF) has become a key technique for enhancing the reasoning abilities of Large Language Models (LLMs). However, its reliance on sparse, outcome-based rewards—which only indicate if a final answer is correct or not—fails to provide granular guidance on the reasoning process itself. This limitation hinders efficient learning, as the model cannot distinguish between high-quality and inefficient solutions, nor can it learn effectively from different types of failures. To address this, we observe that an LLM’s self-certainty often correlates with task difficulty and solution quality. We introduce Difficulty-Aware Certainty-guided Exploration (DACE), a novel RL algorithm that leverages this insight to dynamically balance the exploration-exploitation trade-off. DACE assesses task difficulty online based on the policy’s success rate. It then uses this signal to modulate an intrinsic reward: for difficult tasks where the model is struggling, DACE encourages exploration by penalizing high certainty; for easier tasks, it encourages learning efficiency by rewarding high certainty. Experiments on challenging mathematical reasoning benchmarks (AIME, MATH) show that DACE significantly outperforms strong baselines. The DACE-trained models not only achieve higher accuracy but also demonstrate more robust performance when scaling test-time compute, validating that our adaptive approach fosters effective exploration without sacrificing precision.

††footnotetext: †Work done during internship at ByteDance.††footnotetext: ‡Project lead.

1 Introduction

Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning domains such as mathematics and programming [1, 2]. A key driver of this success has been Reinforcement Learning with Verifiable Feedback (RLVF), a paradigm that fine-tunes models using rewards derived from rule-based verifiers [3]. These verifiers offer a scalable and objective method for assigning binary rewards (e.g., 0 for incorrect, 1 for correct). However, this reliance on sparse, outcome-based rewards introduces a significant limitation: the inability to provide granular feedback on the reasoning process itself. A verifier cannot distinguish between two correct solutions of varying quality—one being concise and elegant, the other verbose and inefficient. Similarly, it treats all incorrect solutions as equally uninformative, failing to guide the model away from specific reasoning fallacies.

This issue is illustrated in Figure 1, which showcases responses from Qwen2.5-7B [4] to problems from the American Mathematics Competitions benchmark. When generating incorrect solutions (left), the model may exhibit qualitatively different failures; some represent promising but flawed reasoning paths worth exploring, while others are simple dead ends. When generating correct solutions (right), some are notably more direct and efficient. From a policy learning perspective, an ideal reward signal would encourage exploration away from common failure modes while simultaneously steering the policy to exploit and refine efficient, correct solutions. This level of guidance is unattainable with simple binary rewards.

While prior work has explored process-based rewards [5, 6, 7], these methods can be costly to scale. We turn instead to intrinsic signals generated by the model itself. Observing Figure 1 again, we note that the model’s self-certainty—its confidence in its own generation [8]—correlates with these qualitative differences. For challenging problems, low-certainty responses may signal valuable exploration. For simpler problems, high-certainty responses often reflect efficient, reliable reasoning. This observation leads to our central research question:

Can we leverage an intrinsic signal like self-certainty to dynamically guide the model’s exploration-exploitation trade-off based on its real-time assessment of task difficulty?

In this paper, we answer this question affirmatively. We first use a toy RL environment to demonstrate that the optimal strategy is contingent on task difficulty: exploration is vital for hard tasks, while exploitation accelerates convergence on easy ones. Building on this insight, we introduce Difficulty-Aware Certainty-guided Exploration (DACE), a novel algorithm that dynamically adjusts its learning objective. For problems the model finds difficult (i.e., has a low success rate), DACE encourages lower certainty to promote exploration. Conversely, for problems the model has mastered, it encourages higher certainty to exploit and refine its knowledge.

Our extensive experiments on large-scale mathematical reasoning benchmarks validate the effectiveness of DACE. Key findings include:

•

DACE consistently outperforms a strong GRPO baseline, delivering significant performance gains on challenging, competition-level datasets like AIME25 (+1.3) and AIME24 (+2.9).

•

Ablation studies confirm that DACE’s adaptive strategy is superior to fixed strategies of pure exploration or pure exploitation, which are shown to be suboptimal.

•

The performance advantage of DACE-trained models widens when scaling test-time compute, demonstrating that our method fosters a more robust and diverse set of correct reasoning paths without sacrificing precision.

Our work demonstrates that harnessing intrinsic signals to intelligently navigate the exploration-exploitation dilemma is a powerful as well as easy to plug-in approach for LLM training in sparse-reward settings. We believe DACE represents a notable step toward more effective reinforcement learning for complex reasoning.

2 Related Works

Our work, DACE, is situated at the intersection of two major research areas: the foundational exploration-exploitation dilemma in Reinforcement Learning (RL) and the emerging use of intrinsic signals for guiding LLM reasoning. We first review the classical RL context before examining how these principles have been adapted, and often simplified, for LLMs, thereby motivating our approach for a more adaptive strategy.

The Exploration-Exploitation Dilemma in Reinforcement Learning. The need to balance gathering new information (exploration) with leveraging known rewards (exploitation) is a fundamental challenge in RL, particularly in domains with sparse rewards [9]. In the context of LLM reasoning, where a reward may only be granted for a complete and correct final answer, this challenge is especially acute. Theoretical approaches have established near-optimal regret bounds in structured settings like bandits and MDPs [10, 11, 12, 13]. In practical deep RL, this trade-off is managed through various techniques, including lookahead planning [14, 15], policy randomization [16, 17, 18, 19], and—most relevant to our work—the use of intrinsic rewards. The core idea of intrinsic motivation is to augment the sparse external reward ( $r_{ext}$ ) with a dense, internally-generated signal ( $r_{int}$ ), optimizing for a combined objective:

[TABLE]

where $\beta$ controls the influence of the intrinsic reward. Researchers have proposed many forms of $r_{int}$ to encourage exploration, such as novelty-based signals derived from state counts [18] or prediction errors (curiosity) [20, 21, 22, 23], information gain [24], or skill diversity [25]. A key limitation of many of these methods is that the intrinsic signal is applied uniformly, often with a fixed goal (e.g., always seek novelty), without adapting to the agent’s growing competence or the specific difficulty of the current task.

Intrinsic Signals for LLM Reasoning. When applying these concepts to LLM reasoning, the research has diverged into two distinct streams, creating a dichotomy between purely exploitative and purely explorative objectives. On one hand, a significant body of work uses intrinsic signals to enforce exploitation. These methods often operate without any external reward ( $r_{ext}$ ) and aim to refine the LLM’s existing knowledge by maximizing signals like self-consistency [26], self-certainty [27], or confidence (i.e., negative entropy) [28, 29, 30]. While successful, particularly for test-time adaptation, these approaches are inherently conservative and may struggle with problems that require novel lines of reasoning beyond the model’s initial high-confidence paths. On the other hand, a second stream of work uses intrinsic signals to promote exploration. Here, the primary focus has been on maximizing policy entropy, either to understand its role in reasoning [31, 30] or to explicitly encourage diverse thought processes during supervised training [32, 33]. Other intrinsic signals include the probability ratio between the current and a reference model, used to guide the generation process [34, 35]. These methods are effective at discovering new strategies but risk over-exploring or failing to commit to a promising solution.

Our work, DACE, bridges this gap. We argue that the choice to explore or exploit should not be static. Instead, it should be a dynamic decision guided by the model’s awareness of the task’s difficulty. We propose using self-certainty not as a fixed goal to be maximized, but as an adaptive intrinsic signal. When certainty is high on an easy problem, DACE encourages exploitation. When certainty is low on a difficult problem, it encourages exploration. In this way, DACE leverages a well-established intrinsic signal [8] to create a nuanced, difficulty-aware balance between exploration and exploitation for LLM reasoning.

3 Motivating DACE: The Need for Adaptive Exploration

We begin by presenting an illustrative case study to build the intuition for our proposed method, Difficulty-Aware Certainty-guided Exploration (DACE). This study demonstrates a core principle: the optimal balance between exploration and exploitation is not static but depends critically on task difficulty. The findings from this simplified setting reveal the limitations of fixed strategies and provide a clear rationale for an adaptive approach like DACE.

3.1 A Toy Learning Setting

We construct a simple continuous reinforcement learning environment where an agent’s policy is a one-dimensional Gaussian distribution, $\pi(a;\mu,\sigma)\sim\frac{1}{\sqrt{2\pi\sigma}}\exp(-\frac{(a-\mu)^{2}}{2\sigma^{2}})$ . Here, the mean $\mu$ represents the agent’s belief about the optimal action, while the standard deviation $\sigma$ represents its certainty. A small $\sigma$ indicates high certainty (exploitation), whereas a large $\sigma$ indicates low certainty (exploration). The reward function is a mixture of two Gaussian distributions, ${\mathcal{R}}(a)={\mathcal{R}}(a;-\mu_{r},\sigma_{r})+{\mathcal{R}}(a;\mu_{r},1.0)$ , where ${\mathcal{R}}(a;\mu_{r},\sigma_{r})\sim\exp(-\frac{(a-\mu_{r})^{2}}{2\sigma_{r}^{2}})$ . By adjusting $\mu_{r}$ and $\sigma_{r}$ , we can simulate environments with varying reward landscapes, from sparse (difficult) to dense (easy).

3.2 Certainty as a Static Learning Objective

We quantify policy certainty using its log standard deviation, $\ln\sigma$ , and incorporate it directly into the learning objective:

[TABLE]

where the hyperparameter $\alpha$ dictates a fixed learning strategy. We compare three distinct, static configurations:

•

Forced Exploration ( $\alpha>0$ ): The policy is always encouraged to increase its variance (lower its certainty), analogous to Maximum Entropy RL.

•

Standard RL ( $\alpha=0$ ): The policy only maximizes the expected reward, with no explicit control over certainty.

•

Forced Exploitation ( $\alpha<0$ ): The policy is always penalized for high variance, encouraging it to shrink its distribution and exploit known rewards.

We use Proximal Policy Optimization (PPO) to optimize the policy under these fixed strategies. Further experimental details are in Appendix 7.1.

3.3 Observation: Optimal Strategy Depends on Difficulty

Our experiments reveal a clear dependency between the optimal learning strategy and the difficulty of the task, which we analyze both qualitatively and quantitatively.

Qualitative Analysis. Figure 2 illustrates two contrasting scenarios. In a difficult setting with sparse rewards (Figure 2(a)), a fixed exploration strategy ( $\alpha>0$ ) is superior. By lowering its certainty, the policy explores a wider action space and successfully discovers the distant reward mode. Conversely, in an easier setting with dense rewards (Figure 2(b)), a fixed exploitation strategy ( $\alpha<0$ ) converges much faster. By increasing its certainty, the policy quickly hones in on the obvious optimal action.

Quantitative Analysis. We confirm this finding quantitatively in Table 1. We vary the task difficulty by adjusting the width of one reward peak, $\sigma_{r}^{1}$ , from narrow (harder) to wide (easier). The results show a clear transition point. For difficult tasks with sparse rewards ( $\sigma_{r}^{1}\leq 0.9$ ), an explorative strategy ( $\alpha>0$ ) achieves the highest final reward. As the task becomes easier with denser rewards ( $\sigma_{r}^{1}\geq 1.0$ ), an exploitative strategy ( $\alpha<0$ ) becomes dominant.

Taken together, these findings highlight a critical limitation in static learning strategies: no single choice of $\alpha$ is optimal across all scenarios. This motivates the central idea behind DACE: an agent should not be forced to always explore or always exploit, but should instead learn to adapt its strategy based on its own assessment of the task’s difficulty.

4 DACE: Difficulty-Aware Certainty-guided Exploration

Building on the insight from our case study, we introduce Difficulty-Aware Certainty-guided Exploration (DACE). DACE is an RL algorithm designed to dynamically balance the exploration-exploitation trade-off. It achieves this by operationalizing the key insight from our motivating example: the agent must first assess the difficulty of a given task and then use that assessment to modulate its own policy certainty, deciding whether to explore or exploit. DACE is composed of three core components: difficulty estimation, a certainty metric, and a mechanism to connect them.

Difficulty-Awareness: The first component of DACE is a mechanism to quantify task difficulty. Crucially, difficulty is not an absolute property of a task but is relative to the policy’s current capabilities. We therefore define an online difficulty measure for a query $x$ with respect to the current policy $\pi$ as its estimated failure rate. This is a practical, policy-relative proxy for difficulty, which we approximate by sampling $n$ responses and averaging their outcomes from a binary verifier:

[TABLE]

A higher $\text{diff}(x;\pi)$ value indicates a problem that the current policy finds more challenging.

Certainty as a Lever for Exploration. The second component is a way to measure and control the policy’s behavior. Generalizing from the standard deviation ( $\sigma$ ) in our toy example, we define policy certainty for autoregressive LLMs based on prior work [8]. The certainty of a generated sequence $y$ is its negative average log-probability:

[TABLE]

This metric serves as a lever to control behavior. Maximizing certainty encourages the policy to use high-probability tokens, leading to deterministic, exploitative behavior. Conversely, minimizing certainty allows for lower-probability tokens, promoting diverse and explorative behavior.

Connecting Difficulty to Certainty via an Adaptive Intrinsic Reward. The core of DACE is the intrinsic reward that forges a dynamic link between policy-relative difficulty (Eq. 3) and behavioral certainty (Eq. 4). We achieve this with an adaptive coefficient, $\alpha(x;\pi)$ , that modulates the certainty-based reward:

[TABLE]

where the coefficient’s sign is determined by comparing the task difficulty to a threshold:

[TABLE]

Here, $\alpha_{\text{scale}}>0$ is a scaling factor, and $\beta_{\text{threshold}}$ is a difficulty threshold. This formulation creates the desired adaptive behavior:

•

For hard tasks ( $\text{diff}(x;\pi)>\beta_{\text{threshold}}$ ): The coefficient $\alpha(x;\pi)$ becomes negative. The objective becomes maximizing $-\alpha_{\text{scale}}\cdot C(y,x;\pi)$ , which is equivalent to minimizing policy certainty. DACE thus encourages the agent to explore when it is struggling.

•

For easy tasks ( $\text{diff}(x;\pi)<\beta_{\text{threshold}}$ ): The coefficient $\alpha(x;\pi)$ is positive. The objective encourages maximizing certainty. DACE thus pushes the policy to exploit and refine its successful strategies on problems it can already solve.

The Full DACE Objective. The complete DACE learning objective integrates this adaptive intrinsic reward with the standard external reward from the environment:

[TABLE]

where $R_{ext}$ is the external reward function (e.g., accuracy) that we want to maximize. This objective dynamically adjusts the learning pressure based on the agent’s real-time performance on a given task. We optimize this objective using Group-wise Rejection Policy Optimization (GRPO). This choice is highly synergistic, as the $n$ samples required by GRPO for its policy update can be directly reused for the difficulty estimation in Equation 3. This makes the implementation of DACE both elegant and computationally efficient.

5 Experiments

We now empirically evaluate Difficulty-Aware Certainty-guided Exploration (DACE) to validate our central hypothesis: that dynamically balancing exploration and exploitation based on task difficulty enhances the mathematical reasoning capabilities of LLMs.

Setup. We use Qwen2.5-7B [4] as our base model and conduct RL training within the VeRL framework [36]. Our training data is a de-duplicated mixture of the DAPO [37] and MATH [38] datasets. For all runs, we adopt the clip-higher trick from DAPO with $\epsilon_{low}=0.2$ and $\epsilon_{high}=0.28$ . For DACE, our default configuration uses a scaling factor of $\alpha_{\text{scale}}=0.05$ and a difficulty threshold of $\beta_{\text{threshold}}=0.4$ . We set the group size for our policy optimizer to $16$ and omit standard entropy/KL regularization terms. A complete list of hyper-parameters is provided in Appendix 7.2.

Evaluation. We follow standard protocols for assessing mathematical reasoning, reporting mean@32 accuracy on four widely-used benchmarks: AIME25, AIME24, AMC23, and MATH-500 [38]. For evaluation, we use a temperature of $0.6$ and top-k of $30$ .

Baselines. We compare DACE against a strong GRPO baseline [3] with the clip-higher trick, as well as several advanced methods. These include Ent-Adv [33], which promotes exploration by encouraging longer reasoning; Clip-Cov and KL-Cov [39], which control updates based on token-entropy covariance; and FR3E [32], which performs targeted rollouts on high-entropy tokens. Due to the high cost of reproduction, we report the results for these baselines directly from their original papers.

5.1 Main Results

The empirical results, summarized in Table 2, demonstrate the effectiveness of DACE’s adaptive strategy. Our method consistently outperforms the strong GRPO baseline on the most challenging benchmarks.

Specifically, DACE achieves absolute gains of +1.3 points on AIME25, +2.9 on AIME24, and +0.5 on AMC23 over GRPO. These improvements establish DACE as the top-performing method on both AIME25 and AMC23 among all listed approaches. While FR3E leads on AIME24, DACE still shows a significant improvement over GRPO and other techniques. On MATH-500, DACE’s performance is on par with the strong GRPO baseline, showing only a marginal 0.2-point difference. These results suggest that DACE’s dynamic approach—exploring on hard problems while exploiting on easy ones—is particularly effective for the most complex, competition-level reasoning tasks.

Scaling Test-Time Compute. We next investigate how models trained with DACE perform when more computational resources are allocated at test time. As shown in Figure 3, the DACE-trained model maintains a consistent and significant performance advantage over the GRPO baseline on AIME25. Notably, the performance gap widens as we increase the number of samples, growing from a +1.2 point lead at mean@16 to a +3.3 point lead at mean@128. This outcome validates the core hypothesis of DACE. By encouraging exploitation (higher certainty) on easier problems, the model maintains high precision for low sample counts (pass@1). Simultaneously, by encouraging exploration (lower certainty) on harder problems, it discovers a wider range of correct solutions, boosting performance at high sample counts (pass@k). This adaptive strategy effectively fosters robust exploration without sacrificing precision.

Training Dynamics. An analysis of the training process reveals the mechanism behind DACE’s performance gains. Figure 4 shows that DACE’s adaptive reward induces a distinctly more exploratory training behavior compared to GRPO. This is evidenced by consistently lower model self-certainty, higher token-level entropy, and slightly longer responses. Interestingly, the divergence between the two methods is most prominent during the intermediate stages of training before the metrics begin to converge. This suggests that DACE injects a critical phase of exploration mid-training, allowing the model to discover more diverse and robust reasoning strategies that lead to its final performance advantage.

Summary of Main Results. Our experiments provide comprehensive evidence for DACE’s effectiveness. First, on standard benchmarks, DACE delivers substantial performance gains over a strong GRPO baseline, especially on difficult math problems. Second, this advantage is amplified when scaling test-time compute, demonstrating the robustness of the learned policy. Finally, training dynamics reveal that DACE successfully encourages a more exploratory behavior during critical training phases. This allows the model to discover better reasoning paths without sacrificing precision, confirming that DACE’s principle of difficulty-aware certainty guidance is a potent method for enhancing complex reasoning.

5.2 Understanding DACE: The Role of the Difficulty Threshold

We now delve deeper into DACE’s core mechanism by analyzing the trade-off between exploration and exploitation as we vary the difficulty threshold, $\beta_{\text{threshold}}$ .

Setup. Using our default configuration, we perform a grid search over $\beta_{\text{threshold}}\in\{0,0.2,0.4,0.6,0.8,1.0\}$ . The endpoints represent fixed, non-adaptive strategies. A threshold of $\beta_{\text{threshold}}=0.0$ forces DACE to always treat problems as ’easy’, thus always rewarding high certainty (pure exploitation). Conversely, $\beta_{\text{threshold}}=1.0$ forces DACE to always treat problems as ’hard’, thus always penalizing high certainty and promoting maximum entropy (pure exploration).

We first examine how $\beta_{\text{threshold}}$ influences training dynamics, shown in Figure 5. As the threshold increases, we observe a clear trend of decreasing self-certainty and increasing response length. This is expected: a higher threshold means more problems are classified as ’hard’, triggering the exploratory, certainty-penalizing reward. Notably, the pure exploration strategy ( $\beta_{\text{threshold}}=1.0$ ) is an outlier, generating responses 3x longer than other settings. This highlights a known pitfall of constant exploration: it can lead to inefficient, redundant reasoning and poor convergence [33]. In contrast, the pure exploitation strategy ( $\beta_{\text{threshold}}=0.0$ ) produces the shortest responses, but as we will see, this comes at the cost of accuracy.

Next, we evaluate the impact of $\beta_{\text{threshold}}$ on final accuracy in Table 3(a). The results clearly show that fixed strategies are suboptimal. Pure exploitation ( $\beta=0.0,0.2$ ) yields low average accuracy, suggesting the model gets stuck in local optima and fails to discover better reasoning paths. Pure exploration ( $\beta=1.0$ ) is also suboptimal, suffering from inconsistent performance and high computational cost (as seen in response length). The best performance is achieved with intermediate thresholds of $0.4$ and $0.8$ , with $\beta_{\text{threshold}}=0.8$ achieving the highest average accuracy of 47.60%. This confirms that the power of DACE lies not in forcing one behavior, but in its ability to dynamically switch between them based on task difficulty.

Finally, we briefly investigate the influence of the reward scaling factor, $\alpha_{\text{scale}}$ . As shown in Table 3(b), when fixing the threshold at our default of 0.4, increasing the weight from 0.05 to 0.10 further improves the average accuracy to 47.17%. This suggests that tuning the strength of DACE’s guidance signal is another promising avenue for optimization.

6 Conclusion

In this work, we addressed the challenge of sparse rewards in Reinforcement Learning with Verifiable Feedback (RLVF) for LLM reasoning. We proposed that an LLM’s intrinsic self-certainty, when guided by task difficulty, can provide a powerful, granular training signal. We introduced Difficulty-Aware Certainty-guided Exploration (DACE), a novel algorithm that dynamically balances the exploration-exploitation trade-off. By assessing its own success rate on a problem, DACE adaptively encourages exploration (lower certainty) on difficult tasks and exploitation (higher certainty) on easier ones. Our experiments on challenging mathematical reasoning benchmarks demonstrated that DACE significantly outperforms strong baselines, and its performance advantage widens as test-time compute is scaled, confirming that it learns a more robust set of solutions.

While DACE proves effective, its reliance on a fixed difficulty threshold presents an opportunity for future work, such as adjusting this parameter in a explicit curriculum learning way [40]. Further research could also explore more sophisticated proxies for difficulty and certainty to refine the guidance signal. Besides, intrinsic signals are prone to over-optimization, and we discuss our methods in tackling the challenge in Appendix 7.4, which we believe is another important research topic in future. Nevertheless, our work validates a key principle: harnessing a model’s internal state to intelligently guide its learning strategy is a potent and sample-efficient approach. DACE represents a principled step toward developing more capable agents capable of mastering complex cognitive tasks.

exploration: A study of count-based exploration for deep reinforcement learning.

In NeurIPS, 2017.

[19]

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy.

Deep exploration via bootstrapped dqn.

NeurIPS, 2016.

[20]

Jürgen Schmidhuber.

Curious model-building control systems.

In IJCNN, 1991.

[21]

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell.

Curiosity-driven exploration by self-supervised prediction.

In ICML, 2017.

[22]

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov.

Exploration by random network distillation.

arXiv preprint arXiv:1810.12894, 2018.

[23]

Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak.

Planning to explore via self-supervised world models.

In ICML, 2020.

[24]

Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel.

Vime: Variational information maximizing exploration.

In NeurIPS, 2016.

[25]

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine.

Diversity is all you need: Learning skills without a reward function.

In ICLR, 2019.

[26]

Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, and Bowen Zhou.

Ttrl: Test-time reinforcement learning.

[27]

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song.

Learning to reason without external rewards.

arXiv preprint arXiv:2505.19590, 2025.

[28]

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian.

Right question is already half the answer: Fully unsupervised llm reasoning incentivization.

arXiv preprint arXiv:2504.05812, 2025.

[29]

Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng.

The unreasonable effectiveness of entropy minimization in llm reasoning.

arXiv preprint arXiv:2505.15134, 2025.

[30]

Zitian Gao, Lynx Chen, Joey Zhou, and Bryan Dai.

One-shot entropy minimization.

arXiv preprint arXiv:2505.20282, 2025.

[31]

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin.

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.

arXiv preprint arXiv:2506.01939, 2025.

[32]

Tianyu Zheng, Tianshun Xing, Qingshui Gu, Taoran Liang, Xingwei Qu, Xin Zhou, Yizhi Li, Zhoufutu Wen, Chenghua Lin, Wenhao Huang, Qian Liu, Ge Zhang, and Zejun Ma.

First return, entropy-eliciting explore.

arXiv preprint arXiv:2507.07017, 2025.

[33]

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei.

Reasoning with exploration: An entropy perspective.

arXiv preprint arXiv:2506.14758, 2025.

[34]

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al.

Process reinforcement through implicit rewards.

arXiv preprint arXiv:2502.01456, 2025.

[35]

Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng.

Free process rewards without process labels.

arXiv preprint arXiv:2412.01981, 2024.

[36]

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu.

Hybridflow: A flexible and efficient rlhf framework.

arXiv preprint arXiv:2409.19256, 2024.

[37]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Lin Yan, Mu Qiao, Yonghui Wu, and Mingxuan Wang.

Dapo: An open-source llm reinforcement learning system at scale.

arXiv preprint arXiv:2503.14476, 2025.

[38]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt.

Measuring mathematical problem solving with the math dataset.

arXiv preprint arXiv:2103.03874, 2021.

[39]

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding.

The entropy mechanism of reinforcement learning for reasoning language models.

arXiv preprint arXiv:2505.22617, 2025.

[40]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston.

Curriculum learning.

In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.

[41]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica.

Efficient memory management for large language model serving with pagedattention.

In SOSP, 2023.

[42]

Hynek Kydlíček.

Math-Verify: Math Verification Library.

https://github.com/huggingface/math-verify, 2025.

Version 0.6.1, License: Apache-2.0, Keywords: verification, math, evaluation.

\beginappendix

7 Additional Experimental Details

7.1 Details of the Toy Model Experiment

The motivating experiment presented in Section 3 was conducted using a custom PPO implementation. The key hyperparameters used for this experiment are detailed in Table 4. The standard entropy bonus was disabled, as the exploration-exploitation balance was explicitly controlled by the $\alpha$ coefficient in our objective function (Equation 2).

7.2 Details of RL Training for LLMs

All of our large-scale reinforcement learning experiments were conducted using the VeRL framework [36].We employed a learning rate schedule with a linear warm-up phase over the first 25 steps, followed by a cosine decay to zero for the remainder of training. The key hyperparameters for our RL training runs are provided in Table 5.

7.3 Details of Evaluation

For all evaluations, we used the vLLM inference engine [41] to generate responses from the trained models. The complete set of sampling parameters is listed in Table 6. To verify correctness, we first extracted the final answer enclosed within the ‘boxed{…}‘ command using regular expressions. We then used the Math-Verify library [42] to programmatically check the correctness of the extracted answer against the ground truth.

7.4 Mitigating Reward Hacking

During our initial experiments, we observed that the intrinsic certainty signal was prone to reward hacking. To ensure training stability and prevent the model from exploiting the reward function, we implemented two key mitigation strategies.

Certainty Normalization. To prevent the model from simply outputting extreme log-probabilities, we normalized the raw certainty values within each group of $n$ responses. Specifically, we applied a group-wise z-score normalization followed by a min-max scaling to bound the final certainty values within the range $[0,1]$ .

Penalizing Code Execution. We identified a specific failure mode where the model would generate solutions containing Python code snippets, effectively using an implicit, unverified computational tool to arrive at an answer. We classified this behavior as a form of reward hacking against the outcome-based verifier. To discourage this, we assigned a reward of zero to any generated solution that contained executable code.

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Open AI. Learning to reason with llms, 2024.
2[2] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r 1: Incentivizing reasoning capability in llms via reinforcement learning. ar Xiv preprint ar Xiv:2501.12948 , 2025.
3[3] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. ar Xiv preprint ar Xiv:2402.03300 , 2024.
4[4] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen 2. 5 technical report. ar Xiv preprint ar Xiv:2412.15115 , 2024.
5[5] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations , 2023.
6[6] Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. ar Xiv preprint ar Xiv:2312.08935 , 2023.
7[7] Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards. ar Xiv preprint ar Xiv:2502.01456 , 2025.
8[8] Zhewei Kang, Xuandong Zhao, and Dawn Song. Scalable best-of-n selection for large language models via self-certainty. ar Xiv preprint ar Xiv:2502.18581 , 2025.