LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval
Jie Lei, Xinlei Chen, Ning Zhang, Mengjiao Wang, Mohit Bansal, Tamara, L. Berg, Licheng Yu

TL;DR
LoopITR innovatively combines dual and cross encoder architectures within a single network for image-text retrieval, leveraging mutual distillation and hard negatives to enhance performance, achieving state-of-the-art results on standard datasets.
Contribution
This work introduces a novel joint architecture that integrates dual and cross encoders with mutual distillation for improved image-text retrieval.
Findings
Achieves state-of-the-art dual encoder performance on Flickr30K and COCO datasets.
Demonstrates the effectiveness of distillation with few hard negatives.
Shows benefits of joint training of dual and cross encoders.
Abstract
Dual encoders and cross encoders have been widely used for image-text retrieval. Between the two, the dual encoder encodes the image and text independently followed by a dot product, while the cross encoder jointly feeds image and text as the input and performs dense multi-modal fusion. These two architectures are typically modeled separately without interaction. In this work, we propose LoopITR, which combines them in the same network for joint learning. Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder. Both steps are efficiently performed together in the same model. Our work centers on empirical analyses of this combined architecture, putting the main focus on the design of the distillation objective. Our experimental results highlight the benefits of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
