Exploration-Driven Optimization for Test-Time Large Language Model Reasoning

Changhao Li; Yuchen Zhuang; Chenxiao Gao; Haotian Sun; Rushi Qiang; Chao Zhang; Bo Dai

arXiv:2605.09853·cs.LG·May 12, 2026

Exploration-Driven Optimization for Test-Time Large Language Model Reasoning

Changhao Li, Yuchen Zhuang, Chenxiao Gao, Haotian Sun, Rushi Qiang, Chao Zhang, Bo Dai

PDF

TL;DR

This paper introduces Exploration-Driven Optimization (EDO), a novel method that enhances large language model reasoning by increasing solution diversity and stability during post-training and inference.

Contribution

EDO extends reward-biasing exploration objectives into RL-based post-training, improving reasoning diversity and stability in large language models.

Findings

01

EDO improves solution diversity in LLM reasoning tasks.

02

EDO achieves 1.0-1.3% accuracy gains on in-distribution benchmarks.

03

EDO provides 1.5% average improvement on out-of-distribution tasks.

Abstract

Post-training techniques combined with inference-time scaling significantly enhance the reasoning and alignment capabilities of large language models (LLMs). However, a fundamental tension arises: inference-time methods benefit from diverse sampling from a relatively flattened probability distribution, whereas reinforcement learning (RL)-based post-training inherently sharpens these distributions. To address this, we propose Exploration-Driven Optimization (EDO), which extends reward-biasing style exploration objectives to iterative post-training and integrates them into standard RL objectives, encouraging greater diversity in sampled solutions while facilitating more effective inference-time computation. We incorporate EDO into iterative Direct Preference Optimization (iDPO) and Group Relative Policy Optimization (GRPO), resulting in two variants: ED-iDPO and ED-GRPO. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.