StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu, Qi Liu, Enhong Chen

TL;DR
This paper introduces StepPO, a step-level agentic reinforcement learning framework that shifts the focus from token-level to step-level decision-making, aiming to better capture and optimize LLM agent behavior.
Contribution
It proposes a novel step-level MDP formulation and credit assignment method, advancing the granularity of policy optimization for agentic RL with LLMs.
Findings
Preliminary experiments show promise for step-level optimization.
StepPO aligns policy updates more closely with agent decision points.
The approach addresses challenges of delayed and sparse rewards in multi-turn settings.
Abstract
General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-training paradigm for empowering LLMs with these capabilities and is playing an increasingly pivotal role in agent training. Unlike single-turn token-level alignment or reasoning enhancement, as in RLHF and RLVR, Agentic RL targets multi-turn interactive settings, where the goal is to optimize core agentic capabilities such as decision making and tool use while addressing new challenges including delayed and sparse rewards, as well as long and variable context. As a result, the token-centric modeling and optimization paradigm inherited from traditional LLM RL is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
