Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis
Tomaso Trinci, Henrique Pi\~neiro Monteagudo, Leonardo Taccari

TL;DR
This paper presents a pipeline that enhances multimodal large language models for safety-critical driving video analysis by integrating video, telematics, and computer vision data to improve event detection.
Contribution
It introduces a novel data fusion and fine-tuning approach using DoRA adapters to improve safety-critical event identification in driving videos.
Findings
Significant improvement in safety-critical event detection accuracy.
Efficient fine-tuning with fewer than 50M trainable parameters.
Effective generation of descriptive captions and QA pairs for training.
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding. However, their application to safety-critical driving scenarios remains limited by an inability to accurately perceive and reason about rare high-stakes dynamic events, such as collisions or near-collisions. To address this, we introduce a pipeline that enhances MLLM perception by fusing downsampled video frames with synchronized high-frequency telematics data (IMU and GPS) and semantic insights from specialized computer vision models. Our pipeline generates high-quality pseudo-labels, including descriptive captions and question-answer pairs, specifically designed to train MLLMs to identify and describe Safety-Critical Events (SCEs) in real-world driving footage. We show the effectiveness of our approach fine-tuning the open-source QwenVL-2.5 model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
