DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

Junshuo Zhang; Chengrui Huang; Feng Guo; Zihan Li; Ke Shi; Menghua Jiang; Jiguo Yu; Shuo Shang; Shen Gao

arXiv:2604.24320·cs.CL·April 28, 2026

DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

Junshuo Zhang, Chengrui Huang, Feng Guo, Zihan Li, Ke Shi, Menghua Jiang, Jiguo Yu, Shuo Shang, Shen Gao

PDF

1 Repo

TL;DR

DPEPO introduces a novel RL algorithm for LLM agents that enables diverse, parallel exploration across multiple environments, significantly improving success rates in complex tasks.

Contribution

The paper proposes DPEPO, a reinforcement learning method that encourages diverse parallel exploration in LLM agents, enhancing environmental understanding and performance.

Findings

01

DPEPO achieves state-of-the-art success rates on ALFWorld and ScienceWorld.

02

It maintains efficiency comparable to strong sequential baselines.

03

The hierarchical reward scheme effectively promotes exploration diversity.

Abstract

Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross-trajectory experiences. Building upon this paradigm, we further propose DPEPO, a reinforcement learning (RL) algorithm that encourages the agent to perform diverse parallel exploration. There are two stages in DPEPO: initial supervised fine-tuning (SFT) imparts basic parallel reasoning and action generation, followed by reinforcement learning stage with a hierarchical reward scheme. We design a parallel trajectory-level success reward and two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LePanda026/Code-for-DPEPO
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.