Continual VQA for Disaster Response Systems
Aditya Kane, V Manushree, Sahil Khose

TL;DR
This paper introduces a continual visual question answering system for disaster response that leverages pre-trained CLIP embeddings and experience replay to improve performance and mitigate catastrophic forgetting in real-life scenarios.
Contribution
It presents a novel continual VQA approach using CLIP embeddings and experience replay, surpassing previous methods on the FloodNet dataset.
Findings
Supervised training with CLIP embeddings improves VQA accuracy.
Continual learning methods reduce catastrophic forgetting.
Achieved state-of-the-art results on FloodNet dataset.
Abstract
Visual Question Answering (VQA) is a multi-modal task that involves answering questions from an input image, semantically understanding the contents of the image and answering it in natural language. Using VQA for disaster management is an important line of research due to the scope of problems that are answered by the VQA system. However, the main challenge is the delay caused by the generation of labels in the assessment of the affected areas. To tackle this, we deployed pre-trained CLIP model, which is trained on visual-image pairs. however, we empirically see that the model has poor zero-shot performance. Thus, we instead use pre-trained embeddings of text and image from this model for our supervised training and surpass previous state-of-the-art results on the FloodNet dataset. We expand this to a continual setting, which is a more real-life scenario. We tackle the problem of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training · Experience Replay
