Toward Autonomous Laboratory Safety Monitoring with Vision Language Models: Learning to See Hazards Through Scene Structure

Trishna Chakraborty; Udita Ghosh; Aldair Ernesto Gongora; Ruben Glatt; Yue Dong; Jiachen Li; Amit K. Roy-Chowdhury; Chengyu Song

arXiv:2602.00414·cs.CV·February 3, 2026

Toward Autonomous Laboratory Safety Monitoring with Vision Language Models: Learning to See Hazards Through Scene Structure

Trishna Chakraborty, Udita Ghosh, Aldair Ernesto Gongora, Ruben Glatt, Yue Dong, Jiachen Li, Amit K. Roy-Chowdhury, Chengyu Song

PDF

Open Access

TL;DR

This paper explores using vision language models for autonomous safety monitoring in laboratories, introducing a structured data pipeline and a scene-graph-guided alignment method to improve hazard detection from visual data.

Contribution

It presents a novel data generation pipeline for aligning images, scene graphs, and text, and proposes a scene-graph-guided alignment technique to enhance VLMs' hazard detection capabilities.

Findings

01

VLMs perform well with textual scene graphs.

02

Performance drops significantly with visual-only inputs.

03

Scene-graph-guided alignment improves hazard detection from images.

Abstract

Laboratories are prone to severe injuries from minor unsafe actions, yet continuous safety monitoring -- beyond mandatory pre-lab safety training -- is limited by human availability. Vision language models (VLMs) offer promise for autonomous laboratory safety monitoring, but their effectiveness in realistic settings is unclear due to the lack of visual evaluation data, as most safety incidents are documented primarily as unstructured text. To address this gap, we first introduce a structured data generation pipeline that converts textual laboratory scenarios into aligned triples of (image, scene graph, ground truth), using large language models as scene graph architects and image generation models as renderers. Our experiments on the synthetic dataset of 1,207 samples across 362 unique scenarios and seven open- and closed-source models show that VLMs perform effectively given textual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Machine Learning in Materials Science · Data Visualization and Analytics