Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models

Mohammed Saidul Islam; Negin Baghbanzadeh; Farnaz Kohankhaki; Afshin Cheraghi; Ali Kore; Shayaan Mehdi; Elham Dolatabadi; Arash Afkanpour

arXiv:2605.18824·cs.LG·May 20, 2026

Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models

Mohammed Saidul Islam, Negin Baghbanzadeh, Farnaz Kohankhaki, Afshin Cheraghi, Ali Kore, Shayaan Mehdi, Elham Dolatabadi, Arash Afkanpour

PDF

TL;DR

This paper presents an automated framework for generating comprehensive, fine-grained benchmarks with rich metadata, improving the evaluation of foundation models across multiple domains.

Contribution

The authors introduce a novel multi-agent framework for automated benchmark creation that enhances coverage, metadata richness, and solution reliability for evaluating foundation models.

Findings

01

Generated benchmarks show lower ground-truth error rates than previous ones.

02

Evaluation reveals performance differences across models that prior benchmarks missed.

03

Framework successfully creates benchmarks in Machine Learning, Finance, and Personal Finance.

Abstract

Evaluation of foundation models often rely on aggregate scores from benchmarks that lack comprehensive coverage and metadata for a fine-grained evaluation. We introduce a framework for automated benchmark generation. Our framework generates evaluation problems grounded in reference material, such as textbooks, producing benchmarks with broad coverage, rich metadata, and robustness to contamination. The pipeline employs a multi-agent architecture for problem generation and a solution-graph-driven strategy that significantly improves the reliability of ground truth solutions. Using the framework, we generate three benchmarks in Machine Learning, Corporate Finance, and Personal Finance. Expert review finds a significantly lower ground-truth error rate than previous benchmarks such as MMLU and GSM8K. Evaluation of 12 commercial and open-source models shows that our benchmarks achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.