RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding

Xi Xiao; Yunbei Zhang; Janet Wang; Lin Zhao; Yuxiang Wei; Hengjia Li; Yanshu Li; Xinyuan Song; Xiao Wang; Swalpa Kumar Roy; Hao Xu; and Tianyang Wang

arXiv:2507.17353·cs.CE·December 11, 2025

RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding

Xi Xiao, Yunbei Zhang, Janet Wang, Lin Zhao, Yuxiang Wei, Hengjia Li, Yanshu Li, Xinyuan Song, Xiao Wang, Swalpa Kumar Roy, Hao Xu, and Tianyang Wang

PDF

Open Access

TL;DR

RoadBench introduces a multimodal benchmark with images and descriptions for road damage understanding, and RoadCLIP is a new vision-language model that leverages domain-specific enhancements to improve detection accuracy.

Contribution

This paper presents the first multimodal dataset and a novel vision-language model specifically designed for comprehensive road damage analysis.

Findings

01

RoadCLIP outperforms vision-only models by 19.2% in accuracy.

02

The dataset and model enable richer contextual understanding of road damages.

03

GPT-driven data augmentation significantly increases training data diversity.

Abstract

Accurate road damage detection is crucial for timely infrastructure maintenance and public safety, but existing vision-only datasets and models lack the rich contextual understanding that textual information can provide. To address this limitation, we introduce RoadBench, the first multimodal benchmark for comprehensive road damage understanding. This dataset pairs high resolution images of road damages with detailed textual descriptions, providing a richer context for model training. We also present RoadCLIP, a novel vision language model that builds upon CLIP by integrating domain specific enhancements. It includes a disease aware positional encoding that captures spatial patterns of road defects and a mechanism for injecting road-condition priors to refine the model's understanding of road damages. We further employ a GPT driven data generation pipeline to expand the image to text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInfrastructure Maintenance and Monitoring · Fire Detection and Safety Systems