Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

Haruki Sakajo; Hiroshi Takato; Hiroshi Tsutsui; Komei Soda; Hidetaka Kamigaito; Taro Watanabe

arXiv:2511.23311·cs.CV·December 1, 2025

Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

Haruki Sakajo, Hiroshi Takato, Hiroshi Tsutsui, Komei Soda, Hidetaka Kamigaito, Taro Watanabe

PDF

Open Access

TL;DR

This paper explores the use of large-scale vision language models for generating safe driving instructions by analyzing synchronized road-facing and driver-facing videos, highlighting their potential and current limitations.

Contribution

It introduces a dataset and evaluates LVLMs' performance in safety-critical driving scenarios, emphasizing the importance of fine-tuning for improved accuracy.

Findings

01

Fine-tuned LVLMs produce more accurate safety-aware instructions.

02

Pre-trained LVLMs have limited effectiveness in driving safety tasks.

03

Challenges remain in detecting subtle or complex events.

Abstract

Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection. These capabilities have promising applications in various industrial domains, such as autonomous driving. For example, LVLMs can generate safety-oriented descriptions of videos captured by road-facing cameras. However, ensuring comprehensive safety requires monitoring driver-facing views as well to detect risky events, such as the use of mobiles while driving. Thus, the ability to process synchronized inputs is necessary from both driver-facing and road-facing cameras. In this study, we develop models and investigate the capabilities of LVLMs by constructing a dataset and evaluating their performance on this dataset. Our experimental results demonstrate that while pre-trained LVLMs have limited effectiveness, fine-tuned LVLMs can generate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Adversarial Robustness in Machine Learning