DriveSOTIF: Advancing Perception SOTIF Through Multimodal Large Language Models

Shucheng Huang; Freda Shi; Chen Sun; Jiaming Zhong; Minghao Ning; Yufeng Yang; Yukun Lu; Hong Wang; Amir Khajepour

arXiv:2505.07084·cs.RO·September 12, 2025

DriveSOTIF: Advancing Perception SOTIF Through Multimodal Large Language Models

Shucheng Huang, Freda Shi, Chen Sun, Jiaming Zhong, Minghao Ning, Yufeng Yang, Yukun Lu, Hong Wang, Amir Khajepour

PDF

1 Repo

TL;DR

This paper introduces DriveSOTIF, a method that fine-tunes multimodal large language models on a specialized dataset to improve perception safety in autonomous driving, achieving significant accuracy gains and real-time performance.

Contribution

It is the first to apply domain-specific MLLM fine-tuning for perception SOTIF in autonomous driving, enhancing hazard detection capabilities.

Findings

01

11.8% improvement in close-ended VQA accuracy

02

12.0% increase in open-ended VQA scores

03

Real-time inference with 0.59 seconds per image

Abstract

Human drivers possess spatial and causal intelligence, enabling them to perceive driving scenarios, anticipate hazards, and react to dynamic environments. In contrast, autonomous vehicles lack these abilities, making it challenging to manage perception-related Safety of the Intended Functionality (SOTIF) risks, especially under complex or unpredictable driving conditions. To address this gap, we propose fine-tuning multimodal large language models (MLLMs) on a customized dataset specifically designed to capture perception-related SOTIF scenarios. Benchmarking results show that fine-tuned MLLMs achieve an 11.8\% improvement in close-ended VQA accuracy and a 12.0\% increase in open-ended VQA scores compared to baseline models, while maintaining real-time performance with a 0.59-second average inference time per image. We validate our approach through real-world case studies in Canada and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

s95huang/drivesotif
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.