Focus on What Really Matters in Low-Altitude Governance: A Management-Centric Multi-Modal Benchmark with Implicitly Coordinated Vision-Language Reasoning Framework

Hao Chang; Zhihui Wang; Lingxiang Wu; Wei An; Boyang Li; Zaiping Lin; Weidong Sheng; Jinqiao Wang

arXiv:2601.19640·cs.CV·April 9, 2026

Focus on What Really Matters in Low-Altitude Governance: A Management-Centric Multi-Modal Benchmark with Implicitly Coordinated Vision-Language Reasoning Framework

Hao Chang, Zhihui Wang, Lingxiang Wu, Wei An, Boyang Li, Zaiping Lin, Weidong Sheng, Jinqiao Wang

PDF

TL;DR

This paper introduces GovLA-10K, a management-focused multi-modal benchmark and GovLA-Reasoner, a vision-language reasoning framework designed for urban governance, emphasizing salient targets and spatially-aware grounding.

Contribution

It presents the first management-oriented benchmark and a novel spatially-aware adapter for integrated visual and language reasoning in low-altitude governance systems.

Findings

01

GovLA-Reasoner improves performance without task-specific fine-tuning.

02

The benchmark focuses on management-relevant targets for urban governance.

03

The Spatially-aware Grounding Adapter effectively integrates spatial cues into reasoning.

Abstract

Low-altitude vision systems are becoming a critical infrastructure for smart city governance. However, existing object-centric perception paradigms and loosely coupled vision-language pipelines are still difficult to support management-oriented anomaly understanding required in real-world urban governance. To bridge this gap, we introduce GovLA-10K, the first management-oriented multi-modal benchmark for low-altitude intelligence, along with GovLA-Reasoner, a unified vision-language reasoning framework tailored for governance-aware aerial perception. Unlike existing studies that aim to exhaustively annotate all visible objects, GovLA-10K is deliberately designed around functionally salient targets that directly correspond to practical management needs, and further provides actionable management suggestions grounded in these observations. To effectively coordinate the fine-grained visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.