Learning Answer Embeddings for Visual Question Answering
Hexiang Hu, Wei-Lun Chao, Fei Sha

TL;DR
This paper introduces a probabilistic embedding-based approach for Visual Question Answering that models semantic relationships among answers, enabling better transfer learning and handling unseen answers.
Contribution
It presents a novel embedding-based probabilistic model for Visual QA that considers answer semantics and supports transfer learning to unseen answers.
Findings
Performs well on in-domain Visual QA datasets.
Effective transfer learning to new datasets with limited answer overlap.
Handles large answer spaces with scalable optimization techniques.
Abstract
We propose a novel probabilistic model for visual question answering (Visual QA). The key idea is to infer two sets of embeddings: one for the image and the question jointly and the other for the answers. The learning objective is to learn the best parameterization of those embeddings such that the correct answer has higher likelihood among all possible answers. In contrast to several existing approaches of treating Visual QA as multi-way classification, the proposed approach takes the semantic relationships (as characterized by the embeddings) among answers into consideration, instead of viewing them as independent ordinal numbers. Thus, the learned embedded function can be used to embed unseen answers (in the training dataset). These properties make the approach particularly appealing for transfer learning for open-ended Visual QA, where the source dataset on which the model is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
