Automated Capability Evaluation of Foundation Models

Arash Afkanpour; Omkar Dige; Fatemeh Tavakoli; Negin Baghbanzadeh; Farnaz Kohankhaki; Elham Dolatabadi

arXiv:2505.17228·cs.LG·October 13, 2025

Automated Capability Evaluation of Foundation Models

Arash Afkanpour, Omkar Dige, Fatemeh Tavakoli, Negin Baghbanzadeh, Farnaz Kohankhaki, Elham Dolatabadi

PDF

4 Reviews

TL;DR

This paper presents ACE, an automated framework that uses active learning and frontier models to evaluate foundation models' capabilities more comprehensively and efficiently than traditional static benchmarks.

Contribution

ACE introduces a scalable, automated, and fine-grained evaluation method that leverages semantic decomposition and active learning to assess foundation models' capabilities with minimal human effort.

Findings

01

ACE generated 433 capabilities and 11,800 tasks in Mathematics.

02

ACE achieved within 0.01 RMSE of exhaustive evaluation by testing less than half of capabilities.

03

ACE uncovers fine-grained differences and provides a more complete capability profile.

Abstract

Current evaluation frameworks for foundation models rely heavily on static, manually curated benchmarks, limiting their ability to capture the full breadth of model capabilities. This paper introduces Active learning for Capability Evaluation (ACE), a novel framework for scalable, automated, and fine-grained evaluation of foundation models. ACE leverages the knowledge embedded in powerful frontier models to decompose a domain into semantically meaningful capabilities and generates diverse evaluation tasks, significantly reducing human effort. In Mathematics, ACE generated 433 capabilities and 11,800 tasks, covering 94% of Wikipedia-defined skills in the domain while introducing novel, coherent ones. To maximize efficiency, ACE fits a capability model in latent semantic space, allowing reliable approximation of a subject model's performance by evaluating only a subset of capabilities via…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 2

Strengths

- Well-scoped problem framing and clear pipeline. The paper cleanly reframes evaluation as approximating a latent capability function, then operationalizes it with a practical, modular pipeline (Fig. 1). This alignment of concept and system is strong. - Large, balanced capability set with automated tasking. In math, ACE builds 433 capabilities and 11.8k tasks; distributional comparisons show balanced area coverage relative to GSM8K and MATH, which are skewed. (Fig. 2a.) - Fine-grained comparat

Weaknesses

- Wikipedia <-> ACE mapping relies on an LLM classifier; no human adjudication or error bars are reported. - The AL ablation fits f on o3-mini only; it’s unclear if the sample-efficiency curve transfers across models.

Reviewer 02Rating 4Confidence 4

Strengths

- The problem of LLM capability evaluation using model-generated tasks is an important and timely topic. - The formulation of a capability hierarchy, along with the analysis of coverage relative to Wikipedia and relevant datasets, is novel and interesting.

Weaknesses

1. A major concern is the lack of clear referencing to prior work. This paper seems to build heavily on Automated Capability Discovery (Lu et al., 2025), sharing many motivations and methods with ACD. While this is fine, it should be explicitly stated in key sections like the introduction, discussing why ACD is insufficient for the research motivation, which algorithmic components are adopted, and what novel contributions the authors introduce. 2. Several components lack clarity: - **Capabili

Reviewer 03Rating 2Confidence 4

Strengths

The paper proposes a pipeline to automatically evaluate the capabilities of a model on specific domains, avoiding curating static benchmarks manually. Moving beyond static benchmarks to an adaptive process for evaluation is considerable.

Weaknesses

Limited Scope: while the paper mentions evaluating close-ended, deterministic, and open-ended tasks (Section 2.2), the experiments are confined to mathematics, a domain with highly verifiable outcomes. The framework's applicability to other domains remains unexplored. Unjustified Comparisons: the critique of the GSM8k benchmark for lacking coverage of advanced topics like differential equations is misplaced, as its scope is intentionally limited to grade-school mathematics. Furthermore, the cla

Reviewer 04Rating 2Confidence 4

Strengths

Originality: While fine-grained hierarchical evaluations and synthetic task generation have been explored, the paper's original contribution is the formulation of LLM evaluation as an active learning problem on a latent capability function. The two-stage framework, which combines LLM-driven knowledge decomposition and task synthesis with Gaussian processes and active learning, is a powerful and original approach for scalable diagnostics. Clarity: The paper is well-structured, and the methodolo

Weaknesses

1. The ceiling effect is a critical methodological flaw. The diagnostic map's resolution is limited to the current frontier model's knowledge, resulting in an unreliable capability profile for truly advanced, research-level concepts. For domains as serious as mathematics, the lack of human audit and curation at the high end means that the generated tasks cannot capture the complexity and novelty found in expert-created datasets like FrontierMath, HARDMath or DeepMath-103K. The framework, therefo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.