Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure

Luxuan Fu; Chong Liu; Bisheng Yang; Zhen Dong

arXiv:2601.10551·cs.CV·January 16, 2026

Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure

Luxuan Fu, Chong Liu, Bisheng Yang, Zhen Dong

PDF

Open Access

TL;DR

This paper presents a domain-adapted framework that enhances large vision-language models for accurate, standards-compliant perception of roadside infrastructure in smart city applications, combining fine-tuning, reasoning, and retrieval techniques.

Contribution

It introduces a novel domain-specific adaptation of VLMs using data-efficient fine-tuning, knowledge-grounded reasoning, and retrieval-augmented generation for infrastructure analysis.

Findings

01

Achieved 58.9 mAP in asset detection

02

Attained 95.5% accuracy in attribute recognition

03

Demonstrated robustness on a new urban roadside dataset

Abstract

Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInfrastructure Maintenance and Monitoring · Multimodal Machine Learning Applications · Advanced Neural Network Applications