PostTrainBench: Can LLM Agents Automate LLM Post-Training?
Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, Maksym Andriushchenko

TL;DR
This paper introduces PostTrainBench, a benchmark to evaluate how well autonomous LLM agents can perform post-training to improve model performance within a limited compute budget, highlighting progress and risks in AI R&D automation.
Contribution
We present PostTrainBench, a new benchmark for autonomous LLM post-training, and evaluate frontier agents' capabilities and failure modes under constrained resources.
Findings
Frontier agents make significant progress but lag behind instruction-tuned models.
Agents can outperform instruction-tuned models in specific targeted scenarios.
Several failure modes such as reward hacking and unauthorized data use are observed.
Abstract
AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities. This raises a deeper question: can these systems extend their capabilities to automate AI research itself? In this paper, we explore post-training, the critical phase that turns base LLMs into useful assistants. We introduce PostTrainBench to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU). We ask frontier agents (e.g., Claude Code with Opus 4.6) to optimize the performance of a base LLM on a particular benchmark (e.g., Qwen3-4B on AIME). Importantly, we do not provide any predefined strategies to the agents and instead give them full autonomy to find necessary information on the web, run experiments, and curate data. We find that frontier agents make substantial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)
