Cross-Modal Retrieval Augmentation for Multi-Modal Classification

Shir Gur; Natalia Neverova; Chris Stauffer; Ser-Nam Lim; Douwe Kiela,; Austin Reiter

arXiv:2104.08108·cs.CV·April 19, 2021

Cross-Modal Retrieval Augmentation for Multi-Modal Classification

Shir Gur, Natalia Neverova, Chris Stauffer, Ser-Nam Lim, Douwe Kiela,, Austin Reiter

PDF

Open Access

TL;DR

This paper introduces a retrieval-augmented multi-modal approach that aligns images and captions in a shared space, significantly enhancing visual question answering performance by leveraging external knowledge sources and novel inference techniques.

Contribution

The paper presents a new alignment model for images and captions and demonstrates how retrieval-augmented transformers improve VQA results over existing baselines.

Findings

01

Improved image-caption retrieval performance

02

Enhanced VQA accuracy with retrieval-augmented models

03

Effective inference-time hot-swapping of indices

Abstract

Recent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvement in performance on image-caption retrieval w.r.t. similar methods. Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines. We further conduct extensive experiments to establish the promise of this approach, and examine novel applications for inference time such as hot-swapping indices.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques