GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution
Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Boqi Shan, Long Lan, Yulin Wang, Hongzhen Wang, Wenjing Yang, Bo Du, Jing Zhang

TL;DR
This paper introduces GeoLLaVA-8K, a multimodal large language model for ultra-high-resolution remote sensing imagery, overcoming data scarcity and token explosion issues to achieve state-of-the-art performance on 8K images.
Contribution
It presents the first RS-focused multimodal LLM capable of 8K resolution, with new high-res datasets and token pruning strategies to handle large image inputs effectively.
Findings
Achieved state-of-the-art on XLRS-Bench with 8K imagery.
Developed SuperRS-VQA and HighRS-VQA datasets for remote sensing.
Proposed token pruning techniques to reduce memory use while maintaining accuracy.
Abstract
Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,3768,376) and HighRS-VQA (avg. 2,0001,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsPruning
