Agentic Reinforced Policy Optimization
Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou

TL;DR
ARPO is a novel reinforcement learning algorithm designed for training multi-turn LLM-based agents, effectively balancing long-horizon reasoning and multi-turn tool interactions, leading to superior performance with less tool-use budget.
Contribution
ARPO introduces an entropy-based adaptive rollout and advantage attribution estimation, enabling better exploration and internalization of stepwise tool-use advantages in LLM agents.
Findings
ARPO outperforms trajectory-level RL algorithms on 13 benchmarks.
It achieves similar or better results with half the tool-use budget.
ARPO enhances exploration in high-uncertainty tool interaction steps.
Abstract
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO…
Peer Reviews
Decision·ICLR 2026 Poster
1. Strong Motivation and Insight – The identification of entropy spikes following tool use provides a concrete empirical basis for adaptive exploration and directly informs ARPO’s design. 2. Innovative Algorithmic Mechanism – The entropy-based adaptive rollout efficiently balances global and partial sampling, encouraging diverse tool-use behaviors without excessive sampling. 3. Refined Credit Assignment – The Advantage Attribution Estimation effectively separates shared versus individual token u
1. Limited Domain Generalization – All experiments are text-based; no multimodal or embodied environments are tested, restricting claims of general agentic applicability. 2. Hyperparameter Sensitivity – The algorithm depends on key parameters (entropy threshold τ, stability factor β), but no sensitivity or ablation study is provided. 3. Lack of Runtime Validation – While ARPO claims reduced rollout complexity (O(n log n)), there is no empirical runtime or resource cost analysis to verify this. 4
- The evaluation covers a wide range of tasks (reasoning, knowledge retrieval, deep search), showing consistent improvements over multiple baselines (GRPO, DAPO, Reinforce++). - ARPO achieves better or comparable results with roughly half the tool-call budget, which is practically valuable given the high cost of tool-based RL. - The GPG theorem provides a solid conceptual framework connecting macro-step rollouts to standard policy gradients, offering a theoretical foundation. - The paper is well
- The paper’s central motivation that token entropy reliably increases after tool calls is not convincingly demonstrated. Figures 1 and 2 appear anecdotal, with unclear statistical support, and may rely on a single sample. - While ARPO’s adaptive rollout is designed to mitigate high-entropy uncertainty, the paper does not show post-training token entropy patterns to demonstrate that the method actually reduces or stabilizes entropy. Without before-and-after comparisons, it’s unclear whether ARPO
- Paper is well motivated and the observation of high entropy tool calling steps is clearly presented. - Consistent improvements in evaluated domains (math, coding, deep search). - Mathematical formulation is grounded and highlights the efficacy of their sampling approach.
- Limited Novelty: The main contribution seems to be the additional sampling introduced by branching from high entropy actions. The learning objective is simply multi-turn GRPO given rollouts generated by their sampling mechanism. - Effects of branching sampling ($P_{t}$) is unclear. There are no ablations documenting the effects of relying on the base probability ($\alpha$) versus relying on the entropy differential ($\beta$) or the branching cutoff ($\tau$). - Unclear if the effectiveness of
Code & Models
- 🤗dongguanting/Qwen3-8B-ARPO-DeepSearchmodel· 90 dl· ♡ 290 dl♡ 2
- 🤗dongguanting/Qwen3-14B-ARPO-DeepSearchmodel· 8 dl· ♡ 58 dl♡ 5
- 🤗dongguanting/Qwen2.5-7B-ARPOmodel· 324 dl· ♡ 2324 dl♡ 2
- 🤗dongguanting/Qwen2.5-3B-ARPOmodel· 6 dl· ♡ 36 dl♡ 3
- 🤗dongguanting/Llama3.1-8B-ARPOmodel· 1 dl· ♡ 11 dl♡ 1
- 🤗dongguanting/QwQ-32B-ARPO-DeepSearchmodel· 6 dl· ♡ 16 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
