What do model reports say about their ChemBio benchmark evaluations? Comparing recent releases to the STREAM framework

Tom Reed; Tegan McCaslin; Luca Righetti

arXiv:2510.20927·cs.CY·October 29, 2025

What do model reports say about their ChemBio benchmark evaluations? Comparing recent releases to the STREAM framework

Tom Reed, Tegan McCaslin, Luca Righetti

PDF

TL;DR

This paper analyzes recent AI model reports on ChemBio safety evaluations, comparing them to the STREAM framework, and highlights areas for improved transparency and best practices in reporting methodologies.

Contribution

It provides a comparative analysis of top AI model reports against the STREAM standard, identifying gaps and recommending improvements for transparency in ChemBio evaluation reporting.

Findings

01

All reports included some useful details not present in others

02

Reports generally lacked detailed examples of test materials

03

There is a need for more comprehensive reporting of elicitation conditions

Abstract

Most frontier AI developers publicly document their safety evaluations of new AI models in model reports, including testing for chemical and biological (ChemBio) misuse risks. This practice provides a window into the methodology of these evaluations, helping to build public trust in AI systems, and enabling third party review in the still-emerging science of AI evaluation. But what aspects of evaluation methodology do developers currently include -- or omit -- in their reports? This paper examines three frontier AI model reports published in spring 2025 with among the most detailed documentation: OpenAI's o3, Anthropic's Claude 4, and Google DeepMind's Gemini 2.5 Pro. We compare these using the STREAM (v1) standard for reporting ChemBio benchmark evaluations. Each model report included some useful details that the others did not, and all model reports were found to have areas for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.