TL;DR
CrashSight is a large-scale, infrastructure-centric video benchmark designed to evaluate vision-language models on traffic crash understanding, emphasizing temporal and causal reasoning in safety-critical scenarios.
Contribution
It introduces a novel dataset with real-world crash videos, annotated with questions to assess scene understanding and reasoning, filling a gap in existing autonomous driving benchmarks.
Findings
Current VLMs perform poorly on temporal and causal reasoning in crash scenarios.
The dataset includes 13K questions across 250 crash videos, covering multiple reasoning levels.
Analysis reveals specific failure modes of state-of-the-art models in safety-critical contexts.
Abstract
Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
