Preliminary suggestions for rigorous GPAI model evaluations
Patricia Paskov, Michael J. Byun, Kevin Wei, Toby Webster

TL;DR
This paper offers preliminary guidelines for rigorous evaluation practices of general-purpose AI models, aiming to improve validity, reproducibility, and cross-disciplinary standards across the evaluation lifecycle.
Contribution
It compiles and organizes evaluation suggestions from multiple fields to enhance the rigor and consistency of GPAI assessments, especially for systemic risk evaluation under EU regulations.
Findings
Proposes structured evaluation stages: design, implementation, execution, documentation.
Draws on established practices from multiple disciplines to inform AI evaluation.
Aims to promote validity, reproducibility, and cross-disciplinary standards in GPAI evaluation.
Abstract
This document presents a preliminary compilation of general-purpose AI (GPAI) evaluation practices that may promote internal validity, external validity and reproducibility. It includes suggestions for human uplift studies and benchmark evaluations, as well as cross-cutting suggestions that may apply to many different evaluation types. Suggestions are organised across four stages in the evaluation life cycle: design, implementation, execution and documentation. Drawing from established practices in machine learning, statistics, psychology, economics, biology and other fields recognised to have important lessons for AI evaluation, these suggestions seek to contribute to the conversation on the nascent and evolving field of the science of GPAI evaluations. The intended audience of this document includes providers of GPAI models presenting systemic risk (GPAISR), for whom the EU AI Act…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
