CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

Zhen Zhang; Kaiqiang Song; Xun Wang; Yebowen Hu; Weixiang Yan; Chenyang Zhao; Henry Peng Zou; Haoyun Deng; Sathish Reddy Indurthi; Shujian Liu; Simin Ma; Xiaoyang Wang; Xin Eric Wang; Song Wang

arXiv:2602.12268·cs.AI·February 24, 2026

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

Zhen Zhang, Kaiqiang Song, Xun Wang, Yebowen Hu, Weixiang Yan, Chenyang Zhao, Henry Peng Zou, Haoyun Deng, Sathish Reddy Indurthi, Shujian Liu, Simin Ma, Xiaoyang Wang, Xin Eric Wang, Song Wang

PDF

Open Access

TL;DR

CM2 introduces a reinforcement learning framework that uses checklist rewards with explicit criteria and evidence grounding to improve multi-turn, multi-step agentic tool use in AI agents, avoiding the need for verifiable outcome rewards.

Contribution

It proposes a scalable RL method that replaces outcome rewards with structured checklist rewards, enabling effective training of multi-turn tool-using agents without heavy environment engineering.

Findings

01

CM2 outperforms supervised fine-tuning on multiple benchmarks.

02

Training in a simulated environment reduces engineering costs.

03

Results match or surpass similar-sized open-source baselines.

Abstract

AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Machine Learning and Data Classification