TL;DR
This paper introduces SelfAlign, a self-supervised alignment module that enhances image-text retrieval accuracy in independent-embedding models without increasing retrieval time or requiring extra supervision.
Contribution
SelfAlign improves retrieval accuracy by aligning image and text at concept and context levels using contrastive learning, without cross-modal interactions during training.
Findings
Boosts state-of-the-art models' accuracy by up to 9.1% on Flickr30K.
Outperforms many interactive-embedding models in accuracy with less retrieval time.
Maintains efficiency with comparable time cost to existing models.
Abstract
Image-text retrieval requires the system to bridge the heterogenous gap between vision and language for accurate retrieval while keeping the network lightweight-enough for efficient retrieval. Existing trade-off solutions mainly study from the view of incorporating cross-modal interactions with the independent-embedding framework or leveraging stronger pretrained encoders, which still demand time-consuming similarity measurement or heavyweight model structure in the retrieval stage. In this work, we propose an image-text alignment module SelfAlign on top of the independent-embedding framework, which improves the retrieval accuracy while maintains the retrieval efficiency without extra supervision. SelfAlign contains two collaborative sub-modules that force image-text alignment at both concept level and context level by self-supervised contrastive learning. It does not require…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
