Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

Xingyuan Hua; Sheng Yue; Ju Ren

arXiv:2605.08978·cs.AI·May 13, 2026

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

Xingyuan Hua, Sheng Yue, Ju Ren

PDF

1 Repo 1 Models

TL;DR

This paper introduces an exploration-aware reinforcement learning framework for large language models that adaptively explores based on uncertainty, improving decision-making in text and GUI tasks.

Contribution

It presents a novel variational inference-based reward function and grouping mechanism enabling selective exploration in LLM agents.

Findings

01

Achieves consistent improvements on text-based benchmarks.

02

Effectively distinguishes when exploration is necessary.

03

Enhances decision-making by targeting informational gaps.

Abstract

Recent advancements in agentic test-time scaling allow models to gather environmental feedback before committing to final actions. A key limitation of existing methods is that they typically employ undifferentiated exploration strategies, lacking the ability to adaptively distinguish when exploration is truly required. In this paper, we propose an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high. Our method introduces a fine-grained reward function via variational inference that explicitly evaluates exploratory actions by estimating their potential to improve future decision-making, together with an exploration-aware grouping mechanism that separates exploratory actions from task-completion actions during optimization. By targeting informational gaps, this design allows agents to explore selectively and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HansenHua/EAPO-ICML26
github

Models

🤗
hansenhua/EAPO-ICML26
model· 133 dl· ♡ 1
133 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.