Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?

Wanyi Chen; Xiao Yang; Xu Yang; Tianming Sha; Qizheng Li; Zhuo Wang; Bowen Xian; Fang Kong; Weiqing Liu; Jiang Bian

arXiv:2604.10547·cs.AI·May 14, 2026

Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?

Wanyi Chen, Xiao Yang, Xu Yang, Tianming Sha, Qizheng Li, Zhuo Wang, Bowen Xian, Fang Kong, Weiqing Liu, Jiang Bian

PDF

1 Repo

TL;DR

Agent2 RL-Bench is a new diagnostic benchmark for evaluating whether LLM agents can autonomously design, implement, and improve RL post-training pipelines, highlighting current capabilities and limitations.

Contribution

It introduces a unified interface and diverse tasks for assessing agentic RL post-training, along with diagnostic skills for structured analysis.

Findings

01

Agents can sometimes engineer online RL, improving models like ALFWorld.

02

Supervised pipelines remain dominant in successful agent strategies.

03

Large differences in outcomes across agent stacks indicate variability in performance.

Abstract

We introduce Agent2 RL-Bench, a compact diagnostic benchmark for evaluating agentic RL post-training, which tests whether LLM agents can autonomously design, implement, debug, and execute post-training pipelines that improve foundation models. RL post-training increasingly drives model alignment and specialization, yet existing benchmarks are largely static, rewarding supervised fine-tuning or script generation without assessing an agent's ability to close an interactive RL loop. Agent2 RL-Bench provides a unified agent-facing interface: each run starts from an isolated workspace containing a base model, task data, instructions, and a grading API, and agents must iterate within a fixed budget by training models and submitting artifacts for evaluation. The benchmark spans six tasks across three levels, from static rule-based training to judge-based optimization and closed-loop online RL…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/RD-Agent/blob/main/rdagent/scenarios/rl/autorl_bench/README.md
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.