Large Vision-Language Models for Remote Sensing Visual Question   Answering

Surasakdi Siripong; Apirak Chaiyapan; Thanakorn Phonchai

arXiv:2411.10857·cs.CV·November 19, 2024·2 cites

Large Vision-Language Models for Remote Sensing Visual Question Answering

Surasakdi Siripong, Apirak Chaiyapan, Thanakorn Phonchai

PDF

Open Access

TL;DR

This paper introduces a novel large vision-language model for remote sensing visual question answering, enabling more accurate and fluent natural language responses from satellite imagery without predefined answer categories.

Contribution

It presents a two-step training strategy for a generative LVLM tailored to remote sensing, improving over traditional methods in accuracy and relevance.

Findings

01

Outperforms state-of-the-art baselines on RSVQAxBEN dataset

02

Produces more accurate, relevant, and fluent answers according to human evaluation

03

Demonstrates the effectiveness of generative LVLMs in remote sensing analysis

Abstract

Remote Sensing Visual Question Answering (RSVQA) is a challenging task that involves interpreting complex satellite imagery to answer natural language questions. Traditional approaches often rely on separate visual feature extractors and language processing models, which can be computationally intensive and limited in their ability to handle open-ended questions. In this paper, we propose a novel method that leverages a generative Large Vision-Language Model (LVLM) to streamline the RSVQA process. Our approach consists of a two-step training strategy: domain-adaptive pretraining and prompt-based finetuning. This method enables the LVLM to generate natural language answers by conditioning on both visual and textual inputs, without the need for predefined answer categories. We evaluate our model on the RSVQAxBEN dataset, demonstrating superior performance compared to state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques