GeoChat: Grounded Large Vision-Language Model for Remote Sensing
Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das,, Salman Khan, Fahad Shahbaz Khan

TL;DR
GeoChat is a novel grounded large vision-language model specifically designed for remote sensing, enabling multitask conversations, region-specific dialogue, and object grounding in high-resolution RS images, addressing domain-specific challenges.
Contribution
It introduces the first versatile remote sensing VLM with multitask conversational abilities and a new RS multimodal instruction-following dataset, along with a comprehensive benchmark.
Findings
Robust zero-shot performance on RS tasks
Effective region-specific dialogue and object grounding
Outperforms baseline methods in RS multimodal understanding
Abstract
Recent advancements in Large Vision-Language Models (VLMs) have shown great promise in natural image domains, allowing users to hold a dialogue about given visual content. However, such general-domain VLMs perform poorly for Remote Sensing (RS) scenarios, leading to inaccurate or fabricated information when presented with RS domain-specific queries. Such a behavior emerges due to the unique challenges introduced by RS imagery. For example, to handle high-resolution RS imagery with diverse scale changes across categories and many small objects, region-level reasoning is necessary alongside holistic scene interpretation. Furthermore, the lack of domain-specific multimodal instruction following data as well as strong backbone models for RS make it hard for the models to align their behavior with user queries. To address these limitations, we propose GeoChat - the first versatile remote…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsALIGN
