# Multimodal generative AI for automated pavement condition assessment: Benchmarking model performance

**Authors:** Chang Xu, Lei Shu, Anh Dao, Yue Cui, Junghwan Kim, Junghwan Kim, Junghwan Kim, Junghwan Kim

PMC · DOI: 10.1371/journal.pone.0340380 · PLOS One · 2026-02-12

## TL;DR

This paper benchmarks seven AI models for assessing road conditions using street-level images, finding that GPT-4o offers the best balance of accuracy and cost.

## Contribution

The study introduces a benchmark for multimodal AI models in pavement condition assessment, comparing both proprietary and open-source systems.

## Key findings

- MLLMs can interpret street-level imagery for pavement condition tasks effectively and cost-efficiently.
- GPT-4o outperformed other models in responsiveness, accuracy, and computational cost.
- Performance was evaluated across four tasks and five dimensions, highlighting strengths and weaknesses of each model.

## Abstract

Accurate and efficient pavement condition assessment is essential for maintaining roadway safety and optimizing maintenance investments. However, conventional assessment methods such as manual visual inspections and specialized sensing equipment are often time-consuming, expensive, and difficult to scale across large networks. Recent advancements in generative artificial intelligence (GAI) have introduced new opportunities for automating visual interpretation tasks using street-level imagery. This study evaluates the performance of seven multimodal large language models (MLLMs) for road surface condition assessment, including three proprietary models (Gemini 2.5 Pro, OpenAI o1, and GPT-4o) and four open-source models (Gemma 3, Llama 3.2, LLaVA v1.6 Mistral, and LLaVA v1.6 Vicuna). The models were tested across four task categories relevant to pavement management: distress and feature identification, spatial pattern recognition, severity evaluation, and maintenance interval estimation. Model performance was assessed across five dimensions: response rate, response correctness, consistency, multimodal errors, and overall computational intensity and cost. Results indicate that MLLMs can interpret street-level imagery and generate task-relevant outputs in a cost-effective manner. Among the evaluated models, we recommend GPT-4o as the preferred option, as it balances responsiveness, accuracy, and computational cost.

## Full-text entities

- **Diseases:** hallucination (MESH:D006212), distress (MESH:D012128), crack (MESH:D003387), LLMs (MESH:D007806), road condition (MESH:D020763)
- **Chemicals:** GPT-4o (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12900301/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12900301/full.md

## References

86 references — full list in the complete paper: https://tomesphere.com/paper/PMC12900301/full.md

---
Source: https://tomesphere.com/paper/PMC12900301