IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
David Gringras

TL;DR
This study introduces IatroBench, a framework for measuring how large language models with safety measures may withhold critical medical information, revealing identity-dependent withholding and safety gaps.
Contribution
It provides a structured evaluation pipeline for assessing safety-related withholding in frontier models across clinical scenarios, validated against physician scoring.
Findings
Models show identity-contingent withholding of medical advice.
Safety investment correlates with wider withholding gaps.
Standard LLM judges underestimate omission harm in responses.
Abstract
Ask a frontier model how to taper six milligrams of alprazolam (psychiatrist retired, ten days of pills left, abrupt cessation causes seizures) and it tells her to call the psychiatrist she just explained does not exist. Change one word ("I'm a psychiatrist; a patient presents with...") and the same model, same weights, same inference pass produces a textbook Ashton Manual taper with diazepam equivalence, anticonvulsant coverage, and monitoring thresholds. The knowledge was there; the model withheld it. IatroBench measures this gap. Sixty pre-registered clinical scenarios, six frontier models, 3,600 responses, scored on two axes (commission harm, CH 0-3; omission harm, OH 0-4) through a structured-evaluation pipeline validated against physician scoring (kappa_w = 0.571, within-1 agreement 96%). The central finding is identity-contingent withholding: match the same clinical question in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
