Auto-BenchmarkCard: Automated Synthesis of Benchmark Documentation
Aris Hofmann, Inge Vejsbjerg, Dhaval Salwala, Elizabeth M. Daly

TL;DR
Auto-BenchmarkCard automates the creation of accurate, comprehensive benchmark documentation by integrating multi-source data extraction with language model synthesis and validation, enhancing transparency and comparability in AI benchmarking.
Contribution
It introduces a novel workflow combining multi-agent data extraction, LLM synthesis, and factual validation to improve benchmark documentation quality.
Findings
Effective extraction from heterogeneous sources
Improved factual accuracy through validation
Enhanced benchmark transparency and comparability
Abstract
We present Auto-BenchmarkCard, a workflow for generating validated descriptions of AI benchmarks. Benchmark documentation is often incomplete or inconsistent, making it difficult to interpret and compare benchmarks across tasks or domains. Auto-BenchmarkCard addresses this gap by combining multi-agent data extraction from heterogeneous sources (e.g., Hugging Face, Unitxt, academic papers) with LLM-driven synthesis. A validation phase evaluates factual accuracy through atomic entailment scoring using the FactReasoner tool. This workflow has the potential to promote transparency, comparability, and reusability in AI benchmark reporting, enabling researchers and practitioners to better navigate and evaluate benchmark choices.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning in Materials Science · Scientific Computing and Data Management · Computational and Text Analysis Methods
