TL;DR
APTBench is a new benchmark designed to evaluate the agentic capabilities of base large language models during pre-training, focusing on planning and action skills relevant to real-world autonomous tasks.
Contribution
The paper introduces APTBench, a lightweight and cost-effective benchmark that assesses agentic potential during pre-training, bridging the gap between static skill evaluation and post-training agent benchmarks.
Findings
APTBench effectively predicts downstream agent performance.
It covers key agent scenarios like software engineering and research.
The benchmark is lightweight and cost-efficient.
Abstract
With the rapid development of LLM-based agents, there is a growing trend to incorporate agent-specific data into the pre-training stage of LLMs, aiming to better align LLMs with real-world autonomous task execution. However, current pre-training benchmarks primarily focus on isolated and static skills, e.g., common knowledge or mathematical/code reasoning, and fail to reflect model's agentic capabilities. On the other hand, agent benchmarks are typically designed for post-trained models, requiring multi-turn task execution abilities that base models struggle to support. Thus, there is a compelling need for a benchmark that can evaluate agentic potentials during pre-training and guide the model training more effectively. To address this gap, we propose APTBench, a framework that converts real-world agent tasks and successful trajectories into multiple-choice or text completion questions…
Peer Reviews
Decision·Submitted to ICLR 2026
Significant Problem & Interesting Entry Point: The paper correctly identifies a critical gap: the need to evaluate base models for agentic skills before the costly post-training stage. Its core technical idea—converting interactive trajectories into static, single-turn formats (MCQ/TC)—is a clever and pragmatic entry point to solving this challenging problem. Useful Base Model Analysis: The experimental results, regardless of the benchmark's validation, provide valuable insights for the communi
Fundamentally Unjustified Taxonomy and Thin Validation: This is the paper's primary methodological failure. For a paper whose topic is proposing a new benchmark, its main method and focus should be the rigorous selection and validation of what is being measured. This critical part is almost entirely omitted. The paper simply proposes a taxonomy ("Planning," "Action," "Atomic Abilities") based on intuition, with no formal analysis, theoretical grounding, or ablation studies to prove these metrics
- The paper tackles an important problem of assessing base model capabilities for agentic purposes and real-world coding scenarios. A lot of post-trained benchmarks exist like SWE-Bench, MLE-Bench, Terminal-Bench, etc. but they can be only be evaluated on post-trained models due to their multi-turn and iterative setup. No such benchmarks exist for pre-trained models. - Instead of just a benchmark, the authors detail the generation pipeline for building pre-training benchmarks for agentic capabil
- I might've missed this but the paper does not mention clearly what models are used to perturb the correct ground truth to generate the incorrect options for the LLMs. - How do different models affect the benchmark construction quality and correlation scores if different models are used to generate the traces/MCQ options. That might be an important indicator for the usability of this benchmark for pre-training. - EM and Rouge scores are not reliable indicators of performance. I would like to se
- The paper addresses a clear, important, and timely problem. As the field moves to integrate agent-specific data into pre-training, there is a critical need for a benchmark to measure "agentic potential" before the costly post-training stage. The paper correctly identifies this significant gap. - Instead of just creating another end-to-end agent task, the work proposes a novel and creative framework to convert existing, complex agent trajectories into a static, multiple-choice/text-completion f
- The main weakness is around experiment design and results. The experimental results are mixed and inconclusive (e.g., as seen in Table 2, LLAMA3.2-3B has 14 for planning on EnvSetup, but 30 for IssueFix, LLAMA4 has 56 on EnvSetup and 50 on IssueFix, both seem to be good and bad at planning depending on the split). This seem to suggest that the benchmark may be measuring domain-specific knowledge (e.g., knowledge of coding syntax, familiarity with research topics) rather than the intended gene
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
