Vision Language Model for Interpretable and Fine-grained Detection of   Safety Compliance in Diverse Workplaces

Zhiling Chen; Hanning Chen; Mohsen Imani; Ruimin Chen; and Farhad; Imani

arXiv:2408.07146·cs.CV·August 15, 2024

Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces

Zhiling Chen, Hanning Chen, Mohsen Imani, Ruimin Chen, and Farhad, Imani

PDF

Open Access

TL;DR

This paper presents Clip2Safety, a vision-language model framework that improves the accuracy and speed of detecting PPE compliance and attributes in diverse workplaces, enhancing safety monitoring.

Contribution

Introduction of Clip2Safety, a novel interpretable detection framework combining scene recognition, visual prompts, and fine-grained verification for PPE compliance across various scenarios.

Findings

01

Outperforms state-of-the-art VLMs in accuracy.

02

Achieves inference times 200 times faster.

03

Validated across six real-world workplace scenarios.

Abstract

Workplace accidents due to personal protective equipment (PPE) non-compliance raise serious safety concerns and lead to legal liabilities, financial penalties, and reputational damage. While object detection models have shown the capability to address this issue by identifying safety items, most existing models, such as YOLO, Faster R-CNN, and SSD, are limited in verifying the fine-grained attributes of PPE across diverse workplace scenarios. Vision language models (VLMs) are gaining traction for detection tasks by leveraging the synergy between visual and textual information, offering a promising solution to traditional object detection limitations in PPE recognition. Nonetheless, VLMs face challenges in consistently verifying PPE attributes due to the complexity and variability of workplace environments, requiring them to interpret context-specific language and visual cues…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOccupational Health and Safety Research · Safety Warnings and Signage · Risk and Safety Analysis

MethodsRoIPool · 1x1 Convolution · Region Proposal Network · Softmax · Faster R-CNN · Convolution · Non Maximum Suppression · SSD