ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding
Liang Shi, Boyu Jiang, Tong Zeng, Feng Guo

TL;DR
ScVLM is a hybrid vision-language model that improves understanding and description of safety-critical traffic events, reducing hallucinations and enhancing accuracy for driver assistance systems.
Contribution
It introduces a novel hybrid training approach combining supervised and contrastive learning for better SCE classification and description in vision-language models.
Findings
Outperforms existing models in generating accurate SCE descriptions.
Reduces hallucinations in vision-language event understanding.
Validated on over 8,600 real-world traffic events.
Abstract
Accurately identifying, understanding and describing traffic safety-critical events (SCEs), including crashes, tire strikes, and near-crashes, is crucial for advanced driver assistance systems, automated driving systems, and traffic safety. As SCEs are rare events, most general vision-language models (VLMs) have not been trained sufficiently to link SCE videos and narratives, which could lead to hallucinations and missing key safety characteristics. Here, we introduce ScVLM, a novel hybrid methodology that integrates supervised and contrastive learning techniques to classify the severity and types of SCEs, as well as to generate narrative descriptions of SCEs. This approach utilizes classification to enhance VLMs' comprehension of driving videos and improve the rationality of event descriptions. The proposed approach is trained on and evaluated by more than 8,600 SCEs from the Second…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSafety Warnings and Signage
MethodsContrastive Learning
