VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing
Zhiming Luo, Di Wang, Haonan Guo, Jing Zhang, Bo Du

TL;DR
VLRS-Bench is a novel benchmark designed to evaluate complex reasoning in remote sensing using multimodal large language models, addressing the gap in perception-focused RS benchmarks.
Contribution
It introduces the first comprehensive remote sensing reasoning benchmark with 2,000 questions across diverse tasks and phases, constructed with RS-specific priors and expert knowledge.
Findings
Existing MLLMs show significant bottlenecks in complex RS reasoning.
VLRS-Bench covers 14 tasks and 8 temporal phases.
The benchmark highlights critical gaps in current multimodal reasoning capabilities.
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have enabled complex reasoning. However, existing remote sensing (RS) benchmarks remain heavily biased toward perception tasks, such as object recognition and scene classification. This limitation hinders the development of MLLMs for cognitively demanding RS applications. To address this, we propose a Vision Language ReaSoning Benchmark (VLRS-Bench), which is the first benchmark exclusively dedicated to complex RS reasoning. Structured across the three core dimensions of Cognition, Decision, and Prediction, VLRS-Bench comprises 2,000 question-answer pairs with an average question length of 130.19 words, spanning 14 tasks and up to eight temporal phases. VLRS-Bench is constructed via a specialized pipeline that integrates RS-specific priors and expert knowledge to ensure geospatial realism and reasoning complexity.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
