TL;DR
This paper presents ACE, an automated framework that uses active learning and frontier models to evaluate foundation models' capabilities more comprehensively and efficiently than traditional static benchmarks.
Contribution
ACE introduces a scalable, automated, and fine-grained evaluation method that leverages semantic decomposition and active learning to assess foundation models' capabilities with minimal human effort.
Findings
ACE generated 433 capabilities and 11,800 tasks in Mathematics.
ACE achieved within 0.01 RMSE of exhaustive evaluation by testing less than half of capabilities.
ACE uncovers fine-grained differences and provides a more complete capability profile.
Abstract
Current evaluation frameworks for foundation models rely heavily on static, manually curated benchmarks, limiting their ability to capture the full breadth of model capabilities. This paper introduces Active learning for Capability Evaluation (ACE), a novel framework for scalable, automated, and fine-grained evaluation of foundation models. ACE leverages the knowledge embedded in powerful frontier models to decompose a domain into semantically meaningful capabilities and generates diverse evaluation tasks, significantly reducing human effort. In Mathematics, ACE generated 433 capabilities and 11,800 tasks, covering 94% of Wikipedia-defined skills in the domain while introducing novel, coherent ones. To maximize efficiency, ACE fits a capability model in latent semantic space, allowing reliable approximation of a subject model's performance by evaluating only a subset of capabilities via…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Well-scoped problem framing and clear pipeline. The paper cleanly reframes evaluation as approximating a latent capability function, then operationalizes it with a practical, modular pipeline (Fig. 1). This alignment of concept and system is strong. - Large, balanced capability set with automated tasking. In math, ACE builds 433 capabilities and 11.8k tasks; distributional comparisons show balanced area coverage relative to GSM8K and MATH, which are skewed. (Fig. 2a.) - Fine-grained comparat
- Wikipedia <-> ACE mapping relies on an LLM classifier; no human adjudication or error bars are reported. - The AL ablation fits f on o3-mini only; it’s unclear if the sample-efficiency curve transfers across models.
- The problem of LLM capability evaluation using model-generated tasks is an important and timely topic. - The formulation of a capability hierarchy, along with the analysis of coverage relative to Wikipedia and relevant datasets, is novel and interesting.
1. A major concern is the lack of clear referencing to prior work. This paper seems to build heavily on Automated Capability Discovery (Lu et al., 2025), sharing many motivations and methods with ACD. While this is fine, it should be explicitly stated in key sections like the introduction, discussing why ACD is insufficient for the research motivation, which algorithmic components are adopted, and what novel contributions the authors introduce. 2. Several components lack clarity: - **Capabili
The paper proposes a pipeline to automatically evaluate the capabilities of a model on specific domains, avoiding curating static benchmarks manually. Moving beyond static benchmarks to an adaptive process for evaluation is considerable.
Limited Scope: while the paper mentions evaluating close-ended, deterministic, and open-ended tasks (Section 2.2), the experiments are confined to mathematics, a domain with highly verifiable outcomes. The framework's applicability to other domains remains unexplored. Unjustified Comparisons: the critique of the GSM8k benchmark for lacking coverage of advanced topics like differential equations is misplaced, as its scope is intentionally limited to grade-school mathematics. Furthermore, the cla
Originality: While fine-grained hierarchical evaluations and synthetic task generation have been explored, the paper's original contribution is the formulation of LLM evaluation as an active learning problem on a latent capability function. The two-stage framework, which combines LLM-driven knowledge decomposition and task synthesis with Gaussian processes and active learning, is a powerful and original approach for scalable diagnostics. Clarity: The paper is well-structured, and the methodolo
1. The ceiling effect is a critical methodological flaw. The diagnostic map's resolution is limited to the current frontier model's knowledge, resulting in an unreliable capability profile for truly advanced, research-level concepts. For domains as serious as mathematics, the lack of human audit and curation at the high end means that the generated tasks cannot capture the complexity and novelty found in expert-created datasets like FrontierMath, HARDMath or DeepMath-103K. The framework, therefo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
