AI Benchmark Democratization and Carpentry

Gregor von Laszewski; Wesley Brewer; Jeyan Thiyagalingam; Juri Papay; Armstrong Foundjem; Piotr Luszczek; Murali Emani; Shirley V. Moore; Vijay Janapa Reddi; Matthew D. Sinclair; Sebastian Lobentanzer; Sujata Goswami; Benjamin Hawks; Marco Colombo; Nhan Tran; Christine R. Kirkpatrick; Abdulkareem Alsudais; Gregg Barrett; Tianhao Li; Kirsten Morehouse; Shivaram Venkataraman; Rutwik Jain; Kartik Mathur; Victor Lu; Tejinder Singh; Khojasteh Z. Mirza; Kongtao Chen; Sasidhar Kunapuli; Gavin Farrell; Renato Umeton; Geoffrey C. Fox

arXiv:2512.11588·cs.AI·December 15, 2025

AI Benchmark Democratization and Carpentry

Gregor von Laszewski, Wesley Brewer, Jeyan Thiyagalingam, Juri Papay, Armstrong Foundjem, Piotr Luszczek, Murali Emani, Shirley V. Moore, Vijay Janapa Reddi, Matthew D. Sinclair, Sebastian Lobentanzer, Sujata Goswami, Benjamin Hawks, Marco Colombo, Nhan Tran

PDF

Open Access

TL;DR

This paper discusses the need for dynamic, inclusive AI benchmarking frameworks that adapt to rapid AI evolution, emphasizing democratization, education, and community efforts to improve reproducibility and real-world relevance.

Contribution

It introduces the concept of AI Benchmark Carpentry, advocating for systematic education and technical innovation to democratize and adapt benchmarking practices.

Findings

01

Current benchmarks are often static and hardware-focused.

02

Dynamic benchmarking can better reflect real-world AI deployment.

03

Community efforts are essential for democratizing benchmarking.

Abstract

Benchmarks are a cornerstone of modern machine learning, enabling reproducibility, comparison, and scientific progress. However, AI benchmarks are increasingly complex, requiring dynamic, AI-focused workflows. Rapid evolution in model architectures, scale, datasets, and deployment contexts makes evaluation a moving target. Large language models often memorize static benchmarks, causing a gap between benchmark results and real-world performance. Beyond traditional static benchmarks, continuous adaptive benchmarking frameworks are needed to align scientific assessment with deployment risks. This calls for skills and education in AI Benchmark Carpentry. From our experience with MLCommons, educational initiatives, and programs like the DOE's Trillion Parameter Consortium, key barriers include high resource demands, limited access to specialized hardware, lack of benchmark design…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Scientific Computing and Data Management · Explainable Artificial Intelligence (XAI)