LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text   Retrieval

Jie Lei; Xinlei Chen; Ning Zhang; Mengjiao Wang; Mohit Bansal; Tamara; L. Berg; Licheng Yu

arXiv:2203.05465·cs.CV·March 11, 2022·6 cites

LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval

Jie Lei, Xinlei Chen, Ning Zhang, Mengjiao Wang, Mohit Bansal, Tamara, L. Berg, Licheng Yu

PDF

Open Access

TL;DR

LoopITR innovatively combines dual and cross encoder architectures within a single network for image-text retrieval, leveraging mutual distillation and hard negatives to enhance performance, achieving state-of-the-art results on standard datasets.

Contribution

This work introduces a novel joint architecture that integrates dual and cross encoders with mutual distillation for improved image-text retrieval.

Findings

01

Achieves state-of-the-art dual encoder performance on Flickr30K and COCO datasets.

02

Demonstrates the effectiveness of distillation with few hard negatives.

03

Shows benefits of joint training of dual and cross encoders.

Abstract

Dual encoders and cross encoders have been widely used for image-text retrieval. Between the two, the dual encoder encodes the image and text independently followed by a dot product, while the cross encoder jointly feeds image and text as the input and performs dense multi-modal fusion. These two architectures are typically modeled separately without interaction. In this work, we propose LoopITR, which combines them in the same network for joint learning. Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder. Both steps are efficiently performed together in the same model. Our work centers on empirical analyses of this combined architecture, putting the main focus on the design of the distillation objective. Our experimental results highlight the benefits of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications