# Know When to Explore: Difficulty-Aware Certainty as a Guide for LLM Reinforcement Learning

**Authors:** Ang Li, Zhihang Yuan, Yang Zhang, Shouda Liu, Yisen Wang

arXiv: 2509.00125 · 2025-09-03

## TL;DR

This paper introduces DACE, a reinforcement learning algorithm for LLMs that uses self-assessed certainty to adaptively balance exploration and exploitation, improving reasoning performance on challenging benchmarks.

## Contribution

DACE leverages LLMs' self-certainty as a dynamic signal to guide exploration, addressing the limitations of outcome-based rewards in reinforcement learning.

## Key findings

- DACE outperforms strong baselines on mathematical reasoning benchmarks.
- Models trained with DACE achieve higher accuracy and robustness.
- Adaptive exploration improves learning efficiency without sacrificing precision.

## Abstract

Reinforcement Learning with Verifiable Feedback (RLVF) has become a key technique for enhancing the reasoning abilities of Large Language Models (LLMs). However, its reliance on sparse, outcome based rewards, which only indicate if a final answer is correct or not, fails to provide granular guidance on the reasoning process itself. This limitation hinders efficient learning, as the model cannot distinguish between high quality and inefficient solutions, nor can it learn effectively from different types of failures. To address this, we observe that an LLMs self-certainty often correlates with task difficulty and solution quality. We introduce Difficulty Aware Certainty guided Exploration (DACE), a novel RL algorithm that leverages this insight to dynamically balance the exploration exploitation trade-off. DACE assesses task difficulty online based on the policys success rate. It then uses this signal to modulate an intrinsic reward: for difficult tasks where the model is struggling, DACE encourages exploration by penalizing high certainty; for easier tasks, it encourages learning efficiency by rewarding high certainty. Experiments on challenging mathematical reasoning benchmarks (AIME, MATH) show that DACE significantly outperforms strong baselines. The DACE-trained models not only achieve higher accuracy but also demonstrate more robust performance when scaling test-time compute, validating that our adaptive approach fosters effective exploration without sacrificing precision.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00125/full.md

## Figures

14 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00125/full.md

## References

42 references — full list in the complete paper: https://tomesphere.com/paper/2509.00125/full.md

---
Source: https://tomesphere.com/paper/2509.00125