Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges

Sanjeda Akter; Ibne Farabi Shihab; Anuj Sharma

arXiv:2507.02074·cs.CV·September 10, 2025

Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges

Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

PDF

TL;DR

This survey reviews how large language models and vision-language models are used for crash detection in videos, discussing methods, datasets, challenges, and future directions in intelligent transportation systems.

Contribution

It provides a comprehensive taxonomy, compares models and benchmarks, and highlights challenges in applying LLMs to video-based crash detection.

Findings

01

Fusion strategies for multimodal data analyzed

02

Key datasets summarized for crash detection

03

Performance benchmarks compared

Abstract

Crash detection from video feeds is a critical problem in intelligent transportation systems. Recent developments in large language models (LLMs) and vision-language models (VLMs) have transformed how we process, reason about, and summarize multimodal information. This paper surveys recent methods leveraging LLMs for crash detection from video data. We present a structured taxonomy of fusion strategies, summarize key datasets, analyze model architectures, compare performance benchmarks, and discuss ongoing challenges and opportunities. Our review provides a foundation for future research in this fast-growing intersection of video understanding and foundation models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.