INSIGHT: Enhancing Autonomous Driving Safety through Vision-Language Models on Context-Aware Hazard Detection and Edge Case Evaluation

Dianwei Chen; Zifan Zhang; Lei Cheng; Yuchen Liu; and Xianfeng Terry Yang

arXiv:2502.00262·cs.CV·March 30, 2026

INSIGHT: Enhancing Autonomous Driving Safety through Vision-Language Models on Context-Aware Hazard Detection and Edge Case Evaluation

Dianwei Chen, Zifan Zhang, Lei Cheng, Yuchen Liu, and Xianfeng Terry Yang

PDF

TL;DR

INSIGHT introduces a hierarchical vision-language model that fuses multimodal data to improve hazard detection and edge-case evaluation, significantly enhancing autonomous driving safety and robustness.

Contribution

The paper presents a novel VLM framework with multimodal data fusion and attention mechanisms, improving hazard localization and generalization in autonomous driving.

Findings

01

Significant improvement in hazard prediction accuracy on BDD100K dataset.

02

Enhanced generalization performance over existing models.

03

Robustness in complex real-world scenarios.

Abstract

Autonomous driving systems face significant challenges in handling unpredictable edge-case scenarios, such as adversarial pedestrian movements, dangerous vehicle maneuvers, and sudden environmental changes. Current end-to-end driving models struggle with generalization to these rare events due to limitations in traditional detection and prediction approaches. To address this, we propose INSIGHT (Integration of Semantic and Visual Inputs for Generalized Hazard Tracking), a hierarchical vision-language model (VLM) framework designed to enhance hazard detection and edge-case evaluation. By using multimodal data fusion, our approach integrates semantic and visual representations, enabling precise interpretation of driving scenarios and accurate forecasting of potential dangers. Through supervised fine-tuning of VLMs, we optimize spatial hazard localization using attention-based mechanisms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.