Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis

Tomaso Trinci; Henrique Pi\~neiro Monteagudo; Leonardo Taccari

arXiv:2605.22185·cs.CV·May 22, 2026

Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis

Tomaso Trinci, Henrique Pi\~neiro Monteagudo, Leonardo Taccari

PDF

TL;DR

This paper presents a pipeline that enhances multimodal large language models for safety-critical driving video analysis by integrating video, telematics, and computer vision data to improve event detection.

Contribution

It introduces a novel data fusion and fine-tuning approach using DoRA adapters to improve safety-critical event identification in driving videos.

Findings

01

Significant improvement in safety-critical event detection accuracy.

02

Efficient fine-tuning with fewer than 50M trainable parameters.

03

Effective generation of descriptive captions and QA pairs for training.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding. However, their application to safety-critical driving scenarios remains limited by an inability to accurately perceive and reason about rare high-stakes dynamic events, such as collisions or near-collisions. To address this, we introduce a pipeline that enhances MLLM perception by fusing downsampled video frames with synchronized high-frequency telematics data (IMU and GPS) and semantic insights from specialized computer vision models. Our pipeline generates high-quality pseudo-labels, including descriptive captions and question-answer pairs, specifically designed to train MLLMs to identify and describe Safety-Critical Events (SCEs) in real-world driving footage. We show the effectiveness of our approach fine-tuning the open-source QwenVL-2.5 model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.