Evaluation and Comparison of Visual Language Models for Transportation Engineering Problems
Sanjita Prajapati, Tanu Singh, Chinmay Hegde, Pranamesh Chakraborty

TL;DR
This paper evaluates state-of-the-art vision language models for transportation engineering tasks like congestion detection, crack identification, and helmet violation detection, using zero-shot prompting to assess their performance without task-specific training.
Contribution
It provides a comprehensive comparison of VLM models for transportation tasks, highlighting their strengths and limitations for future development.
Findings
VLM models perform comparably to CNNs in image classification.
Object localization with VLMs still requires improvement.
Zero-shot prompting enables task execution without annotated datasets.
Abstract
Recent developments in vision language models (VLM) have shown great potential for diverse applications related to image understanding. In this study, we have explored state-of-the-art VLM models for vision-based transportation engineering tasks such as image classification and object detection. The image classification task involves congestion detection and crack identification, whereas, for object detection, helmet violations were identified. We have applied open-source models such as CLIP, BLIP, OWL-ViT, Llava-Next, and closed-source GPT-4o to evaluate the performance of these state-of-the-art VLM models to harness the capabilities of language understanding for vision-based transportation tasks. These tasks were performed by applying zero-shot prompting to the VLM models, as zero-shot prompting involves performing tasks without any training on those tasks. It eliminates the need for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBIM and Construction Integration · Safety Warnings and Signage
MethodsBLIP: Bootstrapping Language-Image Pre-training · Contrastive Language-Image Pre-training
