Leveraging Visual Question Answering for Image-Caption Ranking
Xiao Lin, Devi Parikh

TL;DR
This paper introduces a novel approach that leverages Visual Question Answering (VQA) as a feature extraction tool to enhance image-caption ranking, significantly improving retrieval accuracy by reasoning about image-caption consistency.
Contribution
The work proposes integrating VQA-based features into image-caption ranking models through score-level and representation-level fusion, advancing state-of-the-art performance.
Findings
Improved caption retrieval by 7.1%
Enhanced image retrieval by 4.4%
Effective use of VQA for cross-modal reasoning
Abstract
Visual Question Answering (VQA) is the task of taking as input an image and a free-form natural language question about the image, and producing an accurate answer. In this work we view VQA as a "feature extraction" module to extract image and caption representations. We employ these representations for the task of image-caption ranking. Each feature dimension captures (imagines) whether a fact (question-answer pair) could plausibly be true for the image and caption. This allows the model to interpret images and captions from a wide variety of perspectives. We propose score-level and representation-level fusion models to incorporate VQA knowledge in an existing state-of-the-art VQA-agnostic image-caption ranking model. We find that incorporating and reasoning about consistency between images and captions significantly improves performance. Concretely, our model improves state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
