LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs

Yujun Zhou; Jingdong Yang; Yue Huang; Kehan Guo; Zoe Emory; Bikram Ghosh; Amita Bedar; Sujay Shekar; Zhenwen Liang; Pin-Yu Chen; Tian Gao; Werner Geyer; Nuno Moniz; Nitesh V Chawla; Xiangliang Zhang

arXiv:2410.14182·cs.CL·February 13, 2026

LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs

Yujun Zhou, Jingdong Yang, Yue Huang, Kehan Guo, Zoe Emory, Bikram Ghosh, Amita Bedar, Sujay Shekar, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang

PDF

1 Datasets

TL;DR

This paper introduces LabSafety Bench, a comprehensive benchmark for evaluating AI models on safety tasks in scientific labs, revealing current models' limitations in hazard detection and risk assessment.

Contribution

It presents a new safety benchmark for AI in labs, highlighting the need for specialized evaluation before real-world deployment.

Findings

01

No model exceeds 70% accuracy in hazard identification.

02

Proprietary models perform well on structured tests but not on open-ended reasoning.

03

Current models are insufficient for safe laboratory use.

Abstract

Artificial Intelligence (AI) is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges. Large language models (LLMs) and vision language models (VLMs) now assist in experiment design and procedural guidance, yet their "illusion of understanding" may lead researchers to overtrust unsafe outputs. Here we show that current models remain far from meeting the reliability needed for safe laboratory operation. We introduce LabSafety Bench, a comprehensive benchmark that evaluates models on hazard identification, risk assessment, and consequence prediction across 765 multiple-choice questions and 404 realistic lab scenarios, encompassing 3,128 open-ended tasks. Evaluations on 19 advanced LLMs and VLMs show that no model evaluated on hazard identification surpasses 70% accuracy. While proprietary models perform well on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

yujunzhou/LabSafety_Bench
dataset· 78 dl
78 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.