Model Capability Assessment and Safeguards for Biological Weaponization
Michael Richter

TL;DR
This paper benchmarks several AI models on STEM prompts to assess their reasoning capabilities and potential for biological misuse, highlighting safety concerns and policy implications.
Contribution
It provides a comparative analysis of model capabilities and identifies risks of biological weaponization, emphasizing the need for improved safeguards and policy responses.
Findings
Gemini and Meta scored very high on benign quantitative tasks.
Gemini showed lack of contextual awareness in harmful intent detection.
Gemini's capability appears to outpace moderation calibration, raising safety concerns.
Abstract
AI leaders and safety reports increasingly warn that advances in model reasoning may enable biological misuse, including by low-expertise users, while major labs describe safeguards as expanding but still evolving rather than settled. This study benchmarks ChatGPT 5.2 Auto, Gemini 3 Pro Thinking, Claude Opus 4.5 and Meta's Muse Spark Thinking on 73 novice-framed, open-ended benign STEM prompts to measure operational intelligence. On benign quantitative tasks, both Gemini and Meta scored very high; ChatGPT was partially useful but text-thinned, and Claude was sparsest with some apparent false-positive refusals. A second test set detected subtle harmful intent: edge case prompts revealed Gemini's seeming lack of contextual awareness. These results warranted a focused weaponization analysis on Gemini as capability appeared to be outpacing moderation calibration. Gemini was tested across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
