TL;DR
This paper introduces DriveSOTIF, a method that fine-tunes multimodal large language models on a specialized dataset to improve perception safety in autonomous driving, achieving significant accuracy gains and real-time performance.
Contribution
It is the first to apply domain-specific MLLM fine-tuning for perception SOTIF in autonomous driving, enhancing hazard detection capabilities.
Findings
11.8% improvement in close-ended VQA accuracy
12.0% increase in open-ended VQA scores
Real-time inference with 0.59 seconds per image
Abstract
Human drivers possess spatial and causal intelligence, enabling them to perceive driving scenarios, anticipate hazards, and react to dynamic environments. In contrast, autonomous vehicles lack these abilities, making it challenging to manage perception-related Safety of the Intended Functionality (SOTIF) risks, especially under complex or unpredictable driving conditions. To address this gap, we propose fine-tuning multimodal large language models (MLLMs) on a customized dataset specifically designed to capture perception-related SOTIF scenarios. Benchmarking results show that fine-tuned MLLMs achieve an 11.8\% improvement in close-ended VQA accuracy and a 12.0\% increase in open-ended VQA scores compared to baseline models, while maintaining real-time performance with a 0.59-second average inference time per image. We validate our approach through real-world case studies in Canada and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
