Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?
Xuezheng Chen, Zhengbo Zou

TL;DR
This paper introduces ConstructionSite 10k, a comprehensive dataset for evaluating vision language models in construction safety inspection, highlighting their zero-shot and few-shot capabilities and the need for further training.
Contribution
The paper presents a new large-scale dataset with annotations for multiple safety inspection tasks, enabling better evaluation and development of VLMs in construction safety.
Findings
VLMs show strong zero-shot and few-shot generalization abilities.
Additional training improves VLM performance on construction safety tasks.
The dataset facilitates training and benchmarking of VLMs for construction safety inspection.
Abstract
Construction safety inspections typically involve a human inspector identifying safety concerns on-site. With the rise of powerful Vision Language Models (VLMs), researchers are exploring their use for tasks such as detecting safety rule violations from on-site images. However, there is a lack of open datasets to comprehensively evaluate and further fine-tune VLMs in construction safety inspection. Current applications of VLMs use small, supervised datasets, limiting their applicability in tasks they are not directly trained for. In this paper, we propose the ConstructionSite 10k, featuring 10,000 construction site images with annotations for three inter-connected tasks, including image captioning, safety rule violation visual question answering (VQA), and construction element visual grounding. Our subsequent evaluation of current state-of-the-art large pre-trained VLMs shows notable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
