What do model reports say about their ChemBio benchmark evaluations? Comparing recent releases to the STREAM framework
Tom Reed, Tegan McCaslin, Luca Righetti

TL;DR
This paper analyzes recent AI model reports on ChemBio safety evaluations, comparing them to the STREAM framework, and highlights areas for improved transparency and best practices in reporting methodologies.
Contribution
It provides a comparative analysis of top AI model reports against the STREAM standard, identifying gaps and recommending improvements for transparency in ChemBio evaluation reporting.
Findings
All reports included some useful details not present in others
Reports generally lacked detailed examples of test materials
There is a need for more comprehensive reporting of elicitation conditions
Abstract
Most frontier AI developers publicly document their safety evaluations of new AI models in model reports, including testing for chemical and biological (ChemBio) misuse risks. This practice provides a window into the methodology of these evaluations, helping to build public trust in AI systems, and enabling third party review in the still-emerging science of AI evaluation. But what aspects of evaluation methodology do developers currently include -- or omit -- in their reports? This paper examines three frontier AI model reports published in spring 2025 with among the most detailed documentation: OpenAI's o3, Anthropic's Claude 4, and Google DeepMind's Gemini 2.5 Pro. We compare these using the STREAM (v1) standard for reporting ChemBio benchmark evaluations. Each model report included some useful details that the others did not, and all model reports were found to have areas for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
