Focus on What Really Matters in Low-Altitude Governance: A Management-Centric Multi-Modal Benchmark with Implicitly Coordinated Vision-Language Reasoning Framework
Hao Chang, Zhihui Wang, Lingxiang Wu, Wei An, Boyang Li, Zaiping Lin, Weidong Sheng, Jinqiao Wang

TL;DR
This paper introduces GovLA-10K, a management-focused multi-modal benchmark and GovLA-Reasoner, a vision-language reasoning framework designed for urban governance, emphasizing salient targets and spatially-aware grounding.
Contribution
It presents the first management-oriented benchmark and a novel spatially-aware adapter for integrated visual and language reasoning in low-altitude governance systems.
Findings
GovLA-Reasoner improves performance without task-specific fine-tuning.
The benchmark focuses on management-relevant targets for urban governance.
The Spatially-aware Grounding Adapter effectively integrates spatial cues into reasoning.
Abstract
Low-altitude vision systems are becoming a critical infrastructure for smart city governance. However, existing object-centric perception paradigms and loosely coupled vision-language pipelines are still difficult to support management-oriented anomaly understanding required in real-world urban governance. To bridge this gap, we introduce GovLA-10K, the first management-oriented multi-modal benchmark for low-altitude intelligence, along with GovLA-Reasoner, a unified vision-language reasoning framework tailored for governance-aware aerial perception. Unlike existing studies that aim to exhaustively annotate all visible objects, GovLA-10K is deliberately designed around functionally salient targets that directly correspond to practical management needs, and further provides actionable management suggestions grounded in these observations. To effectively coordinate the fine-grained visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
