Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos
Md Zahid Hasan, Jiajing Chen, Jiyang Wang, Mohammed Shaiqur Rahman,, Ameya Joshi, Senem Velipasalar, Chinmay Hegde, Anuj Sharma, Soumik Sarkar

TL;DR
This paper introduces a CLIP-based framework for recognizing distracted driver behaviors from naturalistic videos, achieving state-of-the-art zero-shot and fine-tuned performance with limited annotated data.
Contribution
It presents a novel application of vision-language models like CLIP for distracted driving detection, enabling effective zero-shot and few-shot learning from naturalistic driving videos.
Findings
State-of-the-art zero-shot performance on public datasets
Effective frame-based and video-based detection frameworks
Robust distracted activity classification with limited data
Abstract
Recognizing the activities causing distraction in real-world driving scenarios is critical for ensuring the safety and reliability of both drivers and pedestrians on the roadways. Conventional computer vision techniques are typically data-intensive and require a large volume of annotated training data to detect and classify various distracted driving behaviors, thereby limiting their efficiency and scalability. We aim to develop a generalized framework that showcases robust performance with access to limited or no annotated training data. Recently, vision-language models have offered large-scale visual-textual pretraining that can be adapted to task-specific learning like distracted driving activity recognition. Vision-language pretraining models, such as CLIP, have shown significant promise in learning natural language-guided visual representations. This paper proposes a CLIP-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Human Pose and Action Recognition · Human-Automation Interaction and Safety
MethodsContrastive Language-Image Pre-training
