Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers
Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew, Zisserman

TL;DR
This paper presents a hybrid approach combining fast dual encoders with transformer-based models for efficient and accurate text-to-visual retrieval, achieving high speed and competitive accuracy on large-scale datasets.
Contribution
It introduces a new fine-grained cross-attention architecture for transformers, and a generic distillation and re-ranking method to combine fast and slow models for large-scale retrieval.
Findings
Significant speedup in inference on Flickr30K dataset
Competitive accuracy with state-of-the-art methods
Improved retrieval performance on VATEX video dataset
Abstract
Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales and is efficient for billions of images using approximate nearest neighbour search. An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings, but is often inapplicable in practice for large-scale retrieval given the cost of the cross-attention mechanisms required for each sample at test time. This work combines the best of both worlds. We make the following three contributions. First, we equip transformer-based models with a new fine-grained cross-attention architecture, providing significant improvements in retrieval accuracy whilst preserving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
