Open-Ended Visual Question Answering by Multi-Modal Domain Adaptation

Yiming Xu; Lin Chen; Zhongwei Cheng; Lixin Duan; Jiebo Luo

arXiv:1911.04058·cs.CV·November 12, 2019

Open-Ended Visual Question Answering by Multi-Modal Domain Adaptation

Yiming Xu, Lin Chen, Zhongwei Cheng, Lixin Duan, Jiebo Luo

PDF

Open Access

TL;DR

This paper introduces a supervised multi-modal domain adaptation approach for visual question answering, effectively transferring knowledge from a source to a target domain with limited labeled data across multiple modalities.

Contribution

It proposes a novel method to align data distributions across domains and modalities, improving VQA performance in realistic open-ended scenarios.

Findings

01

Outperforms state-of-the-art methods on VQA 2.0 and VizWiz datasets

02

Effectively models transferability across images, questions, and answers

03

Enhances VQA accuracy with limited target domain data

Abstract

We study the problem of visual question answering (VQA) in images by exploiting supervised domain adaptation, where there is a large amount of labeled data in the source domain but only limited labeled data in the target domain with the goal to train a good target model. A straightforward solution is to fine-tune a pre-trained source model by using those limited labeled target data, but it usually cannot work well due to the considerable difference between the data distributions of the source and target domains. Moreover, the availability of multiple modalities (i.e., images, questions and answers) in VQA poses further challenges to model the transferability between those different modalities. In this paper, we tackle the above issues by proposing a novel supervised multi-modal domain adaptation method for VQA to learn joint feature embeddings across different domains and modalities.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques