Leveraging Visual Question Answering for Image-Caption Ranking

Xiao Lin; Devi Parikh

arXiv:1605.01379·cs.CV·September 2, 2016·1 cites

Leveraging Visual Question Answering for Image-Caption Ranking

Xiao Lin, Devi Parikh

PDF

Open Access

TL;DR

This paper introduces a novel approach that leverages Visual Question Answering (VQA) as a feature extraction tool to enhance image-caption ranking, significantly improving retrieval accuracy by reasoning about image-caption consistency.

Contribution

The work proposes integrating VQA-based features into image-caption ranking models through score-level and representation-level fusion, advancing state-of-the-art performance.

Findings

01

Improved caption retrieval by 7.1%

02

Enhanced image retrieval by 4.4%

03

Effective use of VQA for cross-modal reasoning

Abstract

Visual Question Answering (VQA) is the task of taking as input an image and a free-form natural language question about the image, and producing an accurate answer. In this work we view VQA as a "feature extraction" module to extract image and caption representations. We employ these representations for the task of image-caption ranking. Each feature dimension captures (imagines) whether a fact (question-answer pair) could plausibly be true for the image and caption. This allows the model to interpret images and captions from a wide variety of perspectives. We propose score-level and representation-level fusion models to incorporate VQA knowledge in an existing state-of-the-art VQA-agnostic image-caption ranking model. We find that incorporating and reasoning about consistency between images and captions significantly improves performance. Concretely, our model improves state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning