From Easy to Hard: Learning Language-guided Curriculum for Visual   Question Answering on Remote Sensing Data

Zhenghang Yuan; Lichao Mou; Qi Wang; and Xiao Xiang Zhu

arXiv:2205.03147·cs.CV·June 15, 2022

From Easy to Hard: Learning Language-guided Curriculum for Visual Question Answering on Remote Sensing Data

Zhenghang Yuan, Lichao Mou, Qi Wang, and Xiao Xiang Zhu

PDF

TL;DR

This paper introduces a novel curriculum learning approach for remote sensing visual question answering, combining multi-level visual features and a language-guided self-paced training strategy to improve model performance.

Contribution

It proposes a multi-level visual feature extraction method and a language-guided self-paced curriculum learning framework for RSVQA, addressing dataset annotation and question difficulty challenges.

Findings

01

Achieves promising performance on three public datasets.

02

Effectively handles questions with varying difficulty levels.

03

Improves model robustness without object annotations.

Abstract

Visual question answering (VQA) for remote sensing scene has great potential in intelligent human-computer interaction system. Although VQA in computer vision has been widely researched, VQA for remote sensing data (RSVQA) is still in its infancy. There are two characteristics that need to be specially considered for the RSVQA task. 1) No object annotations are available in RSVQA datasets, which makes it difficult for models to exploit informative region representation; 2) There are questions with clearly different difficulty levels for each image in the RSVQA task. Directly training a model with questions in a random order may confuse the model and limit the performance. To address these two problems, in this paper, a multi-level visual feature learning method is proposed to jointly extract language-guided holistic and regional image features. Besides, a self-paced curriculum learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.