LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories

Qianpu Sun; Xiaowei Chi; Yuhan Rui; Ying Li; Kuangzhi Ge; Jiajun Li; Sirui Han; and Shanghang Zhang

arXiv:2603.11987·cs.AI·March 13, 2026

LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories

Qianpu Sun, Xiaowei Chi, Yuhan Rui, Ying Li, Kuangzhi Ge, Jiajun Li, Sirui Han, and Shanghang Zhang

PDF

Open Access

TL;DR

LABSHIELD introduces a comprehensive benchmark for evaluating multimodal large language models in safety-critical laboratory scenarios, emphasizing hazard identification and safety reasoning aligned with OSHA standards.

Contribution

This work presents the first detailed safety benchmark for embodied AI in laboratories, including a taxonomy, diverse tasks, and evaluation of multiple models to assess safety reasoning capabilities.

Findings

01

Models show a 32% performance drop in safety tasks compared to general accuracy.

02

Significant gaps exist in hazard interpretation and safety-aware planning.

03

Current models lack sufficient safety-centric reasoning in high-stakes environments.

Abstract

Artificial intelligence is increasingly catalyzing scientific automation, with multimodal large language model (MLLM) agents evolving from lab assistants into self-driving lab operators. This transition imposes stringent safety requirements on laboratory environments, where fragile glassware, hazardous substances, and high-precision laboratory equipment render planning errors or misinterpreted risks potentially irreversible. However, the safety awareness and decision-making reliability of embodied agents in such high-stakes settings remain insufficiently defined and evaluated. To bridge this gap, we introduce LABSHIELD, a realistic multi-view benchmark designed to assess MLLMs in hazard identification and safety-critical reasoning. Grounded in U.S. Occupational Safety and Health Administration (OSHA) standards and the Globally Harmonized System (GHS), LABSHIELD establishes a rigorous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI) · Topic Modeling