WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

Bingnan Liu; Chenhang Cui; Rui Huang; Jiani Luo; Zhirong Shen; Tinghao Wang; Xiande Huang; Lingbei Meng; Fei Shen; An Zhang

arXiv:2605.20306·cs.CV·May 21, 2026

WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

Bingnan Liu, Chenhang Cui, Rui Huang, Jiani Luo, Zhirong Shen, Tinghao Wang, Xiande Huang, Lingbei Meng, Fei Shen, An Zhang

PDF

1 Repo

TL;DR

WildRoadBench introduces a comprehensive UAV benchmark for evaluating vision-language models and autonomous agents in wild aerial road-damage detection, highlighting current limitations and fostering future research.

Contribution

It presents a novel benchmark coupling visual grounding and autonomous agent tasks on a UAV dataset, with detailed evaluation protocols and baseline model assessments.

Findings

01

Closed-source models outperform open-source ones but still leave significant room for improvement.

02

Open-source models struggle with small targets and reasoning tasks.

03

Autonomous agents underperform compared to vision-language models, often failing to submit valid predictions within constraints.

Abstract

We introduce WildRoadBench, a wild aerial road-damage grounding benchmark that couples direct visual grounding by vision-language models with autonomous research-and-engineering by LLM-driven agents on a single professionally annotated UAV corpus. The same image set and the same per-class AP_50 metric are evaluated under two protocols. The VLM Track measures whether a fixed VLM can localise domain-specific damage from one image and one short prompt under a unified prompting, decoding and parsing pipeline. The Agent Track measures whether an autonomous agent, given only a written task brief, a small exploratory slice and a fixed interaction budget, can search the public web, adapt pretrained components, write training and inference code, and submit predictions through a scalar-feedback oracle on a hidden holdout. We benchmark a broad pool of closed-source frontier models and open-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://anonymous.4open.science/r/wildroadbench-0607
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.