Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification

Muhammad Adil; Mehmood Ahmed; Muhammad Aqib; Vicente A. Gonzalez; Gaang Lee; Qipei Mei

arXiv:2604.05210·cs.CV·April 8, 2026

Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification

Muhammad Adil, Mehmood Ahmed, Muhammad Aqib, Vicente A. Gonzalez, Gaang Lee, Qipei Mei

PDF

TL;DR

This study presents a framework combining object detection with small vision-language models to improve construction hazard identification, achieving higher accuracy with minimal computational overhead.

Contribution

It introduces a detection-guided sVLM framework that enhances hazard detection accuracy by integrating object localization with multimodal reasoning in construction scenes.

Findings

01

Gemma-3 4B achieved 50.6% F1-score, up from 34.5%.

02

Hazard explanation quality improved with BERTScore F1 from 0.61 to 0.82.

03

Inference overhead was only 2.5 ms per image.

Abstract

Accurate and timely identification of construction hazards around workers is essential for preventing workplace accidents. While large vision-language models (VLMs) demonstrate strong contextual reasoning capabilities, their high computational requirements limit their applicability in near real-time construction hazard detection. In contrast, small vision-language models (sVLMs) with fewer than 4 billion parameters offer improved efficiency but often suffer from reduced accuracy and hallucination when analyzing complex construction scenes. To address this trade-off, this study proposes a detection-guided sVLM framework that integrates object detection with multimodal reasoning for contextual hazard identification. The framework first employs a YOLOv11n detector to localize workers and construction machinery within the scene. The detected entities are then embedded into structured…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.