Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with   Transformers

Antoine Miech; Jean-Baptiste Alayrac; Ivan Laptev; Josef Sivic; Andrew; Zisserman

arXiv:2103.16553·cs.CV·March 31, 2021

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew, Zisserman

PDF

TL;DR

This paper presents a hybrid approach combining fast dual encoders with transformer-based models for efficient and accurate text-to-visual retrieval, achieving high speed and competitive accuracy on large-scale datasets.

Contribution

It introduces a new fine-grained cross-attention architecture for transformers, and a generic distillation and re-ranking method to combine fast and slow models for large-scale retrieval.

Findings

01

Significant speedup in inference on Flickr30K dataset

02

Competitive accuracy with state-of-the-art methods

03

Improved retrieval performance on VATEX video dataset

Abstract

Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales and is efficient for billions of images using approximate nearest neighbour search. An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings, but is often inapplicable in practice for large-scale retrieval given the cost of the cross-attention mechanisms required for each sample at test time. This work combines the best of both worlds. We make the following three contributions. First, we equip transformer-based models with a new fine-grained cross-attention architecture, providing significant improvements in retrieval accuracy whilst preserving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.