Quantifying CBRN Risk in Frontier Models
Divyanshu Kumar, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi

TL;DR
This paper evaluates the safety vulnerabilities of leading commercial large language models in handling CBRN-related prompts, revealing significant weaknesses in current safety measures and emphasizing the need for improved alignment and evaluation standards.
Contribution
It provides the first comprehensive assessment of LLMs' risks related to CBRN information using a novel dataset and attack methodology, exposing critical safety gaps.
Findings
Deep Inception attacks succeed 86% of the time
Model safety varies widely from 2% to 96% attack success
Eight models are over 70% vulnerable to dangerous prompt modifications
Abstract
Frontier Large Language Models (LLMs) pose unprecedented dual-use risks through the potential proliferation of chemical, biological, radiological, and nuclear (CBRN) weapons knowledge. We present the first comprehensive evaluation of 10 leading commercial LLMs against both a novel 200-prompt CBRN dataset and a 180-prompt subset of the FORTRESS benchmark, using a rigorous three-tier attack methodology. Our findings expose critical safety vulnerabilities: Deep Inception attacks achieve 86.0\% success versus 33.8\% for direct requests, demonstrating superficial filtering mechanisms; Model safety performance varies dramatically from 2\% (claude-opus-4) to 96\% (mistral-small-latest) attack success rates; and eight models exceed 70\% vulnerability when asked to enhance dangerous material properties. We identify fundamental brittleness in current safety alignment, where simple prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
