EarthVL: A Progressive Earth Vision-Language Understanding and Generation Framework
Junjue Wang, Yanfei Zhong, Zihang Chen, Zhuo Zheng, Ailong Ma, Liangpei Zhang

TL;DR
EarthVL introduces a comprehensive framework combining a new dataset and a multi-stage model for advanced Earth scene understanding and generation, focusing on geospatial object reasoning and city planning applications.
Contribution
The paper presents EarthVLNet and EarthVLSet, a novel dataset and progressive model for Earth vision-language tasks, emphasizing object-relational reasoning and scene comprehension.
Findings
EarthVLNet outperforms existing methods on semantic segmentation and VQA benchmarks.
Segmentation features improve VQA performance across datasets.
Open-ended VQA tasks require more advanced vision and language models.
Abstract
Earth vision has achieved milestones in geospatial object recognition but lacks exploration in object-relational reasoning, limiting comprehensive scene understanding. To address this, a progressive Earth vision-language understanding and generation framework is proposed, including a multi-task dataset (EarthVLSet) and a semantic-guided network (EarthVLNet). Focusing on city planning applications, EarthVLSet includes 10.9k sub-meter resolution remote sensing images, land-cover masks, and 761.5k textual pairs involving both multiple-choice and open-ended visual question answering (VQA) tasks. In an object-centric way, EarthVLNet is proposed to progressively achieve semantic segmentation, relational reasoning, and comprehensive understanding. The first stage involves land-cover segmentation to generate object semantics for VQA guidance. Guided by pixel-wise semantics, the object awareness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Remote-Sensing Image Classification
