Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models
Mohammed Saidul Islam, Negin Baghbanzadeh, Farnaz Kohankhaki, Afshin Cheraghi, Ali Kore, Shayaan Mehdi, Elham Dolatabadi, Arash Afkanpour

TL;DR
This paper presents an automated framework for generating comprehensive, fine-grained benchmarks with rich metadata, improving the evaluation of foundation models across multiple domains.
Contribution
The authors introduce a novel multi-agent framework for automated benchmark creation that enhances coverage, metadata richness, and solution reliability for evaluating foundation models.
Findings
Generated benchmarks show lower ground-truth error rates than previous ones.
Evaluation reveals performance differences across models that prior benchmarks missed.
Framework successfully creates benchmarks in Machine Learning, Finance, and Personal Finance.
Abstract
Evaluation of foundation models often rely on aggregate scores from benchmarks that lack comprehensive coverage and metadata for a fine-grained evaluation. We introduce a framework for automated benchmark generation. Our framework generates evaluation problems grounded in reference material, such as textbooks, producing benchmarks with broad coverage, rich metadata, and robustness to contamination. The pipeline employs a multi-agent architecture for problem generation and a solution-graph-driven strategy that significantly improves the reliability of ground truth solutions. Using the framework, we generate three benchmarks in Machine Learning, Corporate Finance, and Personal Finance. Expert review finds a significantly lower ground-truth error rate than previous benchmarks such as MMLU and GSM8K. Evaluation of 12 commercial and open-source models shows that our benchmarks achieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
