Towards Unified Vision Language Models for Forest Ecological Analysis in Earth Observation
Xizhe Xue, Xiao Xiang Zhu

TL;DR
This paper introduces REO-Instruct, a novel benchmark dataset for evaluating vision language models on both descriptive and regression tasks in Earth Observation, specifically focusing on forest ecological analysis.
Contribution
It presents the first unified benchmark for EO that combines qualitative understanding with quantitative biophysical variable prediction, bridging perception and scientific inference.
Findings
Current VLMs struggle with numeric reasoning in EO tasks.
REO-Instruct provides a standardized platform for developing geospatial models.
Baseline evaluations reveal significant challenges in scientific regression tasks.
Abstract
Recent progress in vision language models (VLMs) has enabled remarkable perception and reasoning capabilities, yet their potential for scientific regression in Earth Observation (EO) remains largely unexplored. Existing EO datasets mainly emphasize semantic understanding tasks such as captioning or classification, lacking benchmarks that align multimodal perception with measurable biophysical variables. To fill this gap, we present REO-Instruct, the first unified benchmark designed for both descriptive and regression tasks in EO. REO-Instruct establishes a cognitively interpretable logic chain in forest ecological scenario (human activity,land-cover classification, ecological patch counting, above-ground biomass (AGB) regression), bridging qualitative understanding and quantitative prediction. The dataset integrates co-registered Sentinel-2 and ALOS-2 imagery with structured textual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Remote-Sensing Image Classification · Geographic Information Systems Studies
