Verified Critical Step Optimization for LLM Agents
Mukai Li, Qingcheng Zeng, Tianqing Fang, Zhenwen Liang, Linfeng Song, Qi Liu, Haitao Mi, Dong Yu

TL;DR
The paper introduces Critical Step Optimization (CSO), a new method for improving large language model agents by focusing on verified critical decision points, leading to significant performance gains with less supervision.
Contribution
CSO is a novel approach that targets critical decision steps using verified supervision, starting from failed trajectories and leveraging expert models for high-quality alternatives.
Findings
CSO achieves 37% and 26% relative improvements on GAIA-Text-103 and XBench-DeepSearch.
It outperforms other post-training methods while supervising only 16% of trajectory steps.
The method enhances policy robustness by focusing on verifiable critical decisions.
Abstract
As large language model agents tackle increasingly complex long-horizon tasks, effective post-training becomes critical. Prior work faces fundamental challenges: outcome-only rewards fail to precisely attribute credit to intermediate steps, estimated step-level rewards introduce systematic noise, and Monte Carlo sampling approaches for step reward estimation incur prohibitive computational cost. Inspired by findings that only a small fraction of high-entropy tokens drive effective RL for reasoning, we propose Critical Step Optimization (CSO), which focuses preference learning on verified critical steps, decision points where alternate actions demonstrably flip task outcomes from failure to success. Crucially, our method starts from failed policy trajectories rather than expert demonstrations, directly targeting the policy model's weaknesses. We use a process reward model (PRM) to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
