Auto-BenchmarkCard: Automated Synthesis of Benchmark Documentation

Aris Hofmann; Inge Vejsbjerg; Dhaval Salwala; Elizabeth M. Daly

arXiv:2512.09577·cs.HC·December 11, 2025

Auto-BenchmarkCard: Automated Synthesis of Benchmark Documentation

Aris Hofmann, Inge Vejsbjerg, Dhaval Salwala, Elizabeth M. Daly

PDF

Open Access 1 Datasets 1 Video

TL;DR

Auto-BenchmarkCard automates the creation of accurate, comprehensive benchmark documentation by integrating multi-source data extraction with language model synthesis and validation, enhancing transparency and comparability in AI benchmarking.

Contribution

It introduces a novel workflow combining multi-agent data extraction, LLM synthesis, and factual validation to improve benchmark documentation quality.

Findings

01

Effective extraction from heterogeneous sources

02

Improved factual accuracy through validation

03

Enhanced benchmark transparency and comparability

Abstract

We present Auto-BenchmarkCard, a workflow for generating validated descriptions of AI benchmarks. Benchmark documentation is often incomplete or inconsistent, making it difficult to interpret and compare benchmarks across tasks or domains. Auto-BenchmarkCard addresses this gap by combining multi-agent data extraction from heterogeneous sources (e.g., Hugging Face, Unitxt, academic papers) with LLM-driven synthesis. A validation phase evaluates factual accuracy through atomic entailment scoring using the FactReasoner tool. This workflow has the potential to promote transparency, comparability, and reusability in AI benchmark reporting, enabling researchers and practitioners to better navigate and evaluate benchmark choices.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ibm-research/Auto-BenchmarkCard
dataset· 33 dl
33 dl

Videos

Auto-BenchmarkCard: Automated Synthesis of Benchmark Documentation· underline

Taxonomy

TopicsMachine Learning in Materials Science · Scientific Computing and Data Management · Computational and Text Analysis Methods