TL;DR
Agent2 RL-Bench is a new diagnostic benchmark for evaluating whether LLM agents can autonomously design, implement, and improve RL post-training pipelines, highlighting current capabilities and limitations.
Contribution
It introduces a unified interface and diverse tasks for assessing agentic RL post-training, along with diagnostic skills for structured analysis.
Findings
Agents can sometimes engineer online RL, improving models like ALFWorld.
Supervised pipelines remain dominant in successful agent strategies.
Large differences in outcomes across agent stacks indicate variability in performance.
Abstract
We introduce Agent2 RL-Bench, a compact diagnostic benchmark for evaluating agentic RL post-training, which tests whether LLM agents can autonomously design, implement, debug, and execute post-training pipelines that improve foundation models. RL post-training increasingly drives model alignment and specialization, yet existing benchmarks are largely static, rewarding supervised fine-tuning or script generation without assessing an agent's ability to close an interactive RL loop. Agent2 RL-Bench provides a unified agent-facing interface: each run starts from an isolated workspace containing a base model, task data, instructions, and a grading API, and agents must iterate within a fixed budget by training models and submitting artifacts for evaluation. The benchmark spans six tasks across three levels, from static rule-based training to judge-based optimization and closed-loop online RL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
