Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification
Muhammad Adil, Mehmood Ahmed, Muhammad Aqib, Vicente A. Gonzalez, Gaang Lee, Qipei Mei

TL;DR
This study presents a framework combining object detection with small vision-language models to improve construction hazard identification, achieving higher accuracy with minimal computational overhead.
Contribution
It introduces a detection-guided sVLM framework that enhances hazard detection accuracy by integrating object localization with multimodal reasoning in construction scenes.
Findings
Gemma-3 4B achieved 50.6% F1-score, up from 34.5%.
Hazard explanation quality improved with BERTScore F1 from 0.61 to 0.82.
Inference overhead was only 2.5 ms per image.
Abstract
Accurate and timely identification of construction hazards around workers is essential for preventing workplace accidents. While large vision-language models (VLMs) demonstrate strong contextual reasoning capabilities, their high computational requirements limit their applicability in near real-time construction hazard detection. In contrast, small vision-language models (sVLMs) with fewer than 4 billion parameters offer improved efficiency but often suffer from reduced accuracy and hallucination when analyzing complex construction scenes. To address this trade-off, this study proposes a detection-guided sVLM framework that integrates object detection with multimodal reasoning for contextual hazard identification. The framework first employs a YOLOv11n detector to localize workers and construction machinery within the scene. The detected entities are then embedded into structured…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
