Can Vision-Language Models Understand Construction Workers? An Exploratory Study
Hieu Bui, Nathaniel E. Chodosh, Arash Tavakoli

TL;DR
This study evaluates the ability of three leading vision-language models to recognize construction workers' actions and emotions from images, highlighting their potential and current limitations for construction site safety and monitoring.
Contribution
It provides a comparative analysis of GPT-4o, Florence 2, and LLaVa-1.5 in construction-related behavior recognition tasks using a curated dataset.
Findings
GPT-4o achieved the highest recognition scores.
Models struggled with semantically similar categories.
General-purpose VLMs offer baseline capabilities for construction monitoring.
Abstract
As robotics become increasingly integrated into construction workflows, their ability to interpret and respond to human behavior will be essential for enabling safe and effective collaboration. Vision-Language Models (VLMs) have emerged as a promising tool for visual understanding tasks and offer the potential to recognize human behaviors without extensive domain-specific training. This capability makes them particularly appealing in the construction domain, where labeled data is scarce and monitoring worker actions and emotional states is critical for safety and productivity. In this study, we evaluate the performance of three leading VLMs, GPT-4o, Florence 2, and LLaVa-1.5, in detecting construction worker actions and emotions from static site images. Using a curated dataset of 1,000 images annotated across ten action and ten emotion categories, we assess each model's outputs through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOccupational Health and Safety Research · BIM and Construction Integration · Multimodal Machine Learning Applications
