Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky
Xu Zhang, Mei Chen

TL;DR
This paper demonstrates that fine-tuned transformer models significantly improve crash narrative classification accuracy over traditional methods, with considerations for computational efficiency and deployment practicality in enhancing crash data quality.
Contribution
It introduces a comprehensive comparison of zero-shot LLMs, fine-tuned transformers, and traditional models for secondary crash identification, highlighting the effectiveness of fine-tuned transformers in this context.
Findings
Fine-tuned transformers achieved up to 95% accuracy.
Zero-shot LLMs performed comparably but with higher computational costs.
Mid-sized LLMs can match larger models' performance with less runtime.
Abstract
This study evaluates advanced natural language processing (NLP) techniques to enhance crash data quality by mining crash narratives, using secondary crash identification in Kentucky as a case study. Drawing from 16,656 manually reviewed narratives from 2015-2022, with 3,803 confirmed secondary crashes, we compare three model classes: zero-shot open-source large language models (LLMs) (LLaMA3:70B, DeepSeek-R1:70B, Qwen3:32B, Gemma3:27B); fine-tuned transformers (BERT, DistilBERT, RoBERTa, XLNet, Longformer); and traditional logistic regression as baseline. Models were calibrated on 2015-2021 data and tested on 1,771 narratives from 2022. Fine-tuned transformers achieved superior performance, with RoBERTa yielding the highest F1-score (0.90) and accuracy (95%). Zero-shot LLaMA3:70B reached a comparable F1 of 0.86 but required 139 minutes of inference; the logistic baseline lagged well…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
