Questions beyond Pixels: Integrating Commonsense Knowledge in Visual Question Generation for Remote Sensing

Siran Li; Li Mi; Javiera Castillo-Navarro; Devis Tuia

arXiv:2602.19217·cs.CV·February 24, 2026

Questions beyond Pixels: Integrating Commonsense Knowledge in Visual Question Generation for Remote Sensing

Siran Li, Li Mi, Javiera Castillo-Navarro, Devis Tuia

PDF

Open Access

TL;DR

This paper introduces KRSVQG, a knowledge-aware model for generating diverse, grounded questions about remote sensing images by integrating external commonsense knowledge and employing vision-language pre-training.

Contribution

The paper presents a novel KRSVQG model that incorporates external knowledge triplets and uses captioning as an intermediary, improving question diversity and grounding in remote sensing images.

Findings

01

KRSVQG outperforms existing methods in metrics and human assessments.

02

Constructed two new datasets: NWPU-300 and TextRS-300.

03

Generated questions are richer and better grounded in image and domain knowledge.

Abstract

With the rapid development of remote sensing image archives, asking questions about images has become an effective way of gathering specific information or performing semantic image retrieval. However, current automatically generated questions tend to be simplistic and template-based, which hinders the deployment of question answering or visual dialogue systems for real-world applications. To enrich and diversify the questions with both image content and commonsense knowledge, we propose a Knowledge-aware Remote Sensing Visual Question Generation model (KRSVQG). The proposed model incorporates related knowledge triplets from external knowledge sources to broaden the question content, while employing image captioning as an intermediary representation to ground questions to the corresponding images. Moreover, KRSVQG utilizes a vision-language pre-training and fine-tuning strategy,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling