RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts
Lukas Weidener, Marko Brki\'c, Mihailo Jovanovi\'c, Emre Ulgac, Aakaash Meduri

TL;DR
RefusalBench is a new benchmark for evaluating how frontier large language models refuse biological research prompts, revealing significant variability and calibration issues in their refusal behavior.
Contribution
This paper introduces RefusalBench, a matched-triple benchmark with 141 prompts across risk tiers, enabling robust comparison of model refusal behavior in biological research contexts.
Findings
Refusal rates vary widely from 0.1% to 94.6% across models.
Provider identity influences refusal behavior more than jurisdiction.
Refusal calibration does not reliably indicate safety or dual-use detection.
Abstract
Frontier large language models are increasingly deployed as orchestration backbones for biological research workflows, yet no shared evidence base exists for comparing their refusal behaviour on legitimate research prompts. RefusalBench, introduced here, is a matched-triple benchmark of 141 prompts in 47 bundles that holds task framing constant while varying only biological risk tier (benign, borderline, dual-use), enabling tier-conditioned comparisons robust to subdomain confounding. A 15-prompt should-refuse positive-control module establishes per-model calibration floors; three models fail to refuse even these prompts. Across 19 frontier models in the May 2026 snapshot, strict refusal rates span 0.1% to 94.6% on identical prompts. Jurisdiction does not predict refusal in this snapshot (Mann-Whitney U, p = 0.393; EU n = 1, US bimodal); provider identity does, with Anthropic's API…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
