USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval
Yan Zhang, Zhong Ji, Di Wang, Yanwei Pang, Xuelong Li

TL;DR
This paper introduces a novel unified semantic enhancement method using momentum contrast for image-text retrieval, improving accuracy and efficiency by leveraging global representations, knowledge transfer from CLIP, and dynamic negative sampling.
Contribution
The paper proposes the USER framework that combines semantic enhancement modules with momentum contrastive learning, addressing limitations of existing methods in representation accuracy and negative sample scale.
Findings
Achieves superior retrieval accuracy on MSCOCO and Flickr30K datasets.
Enhances inference efficiency compared to previous approaches.
Effectively enlarges negative sample sets using dynamic queues.
Abstract
As a fundamental and challenging task in bridging language and vision domains, Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality, and its key challenge is to measure the semantic similarity across different modalities. Although significant progress has been achieved, existing approaches typically suffer from two major limitations: (1) It hurts the accuracy of the representation by directly exploiting the bottom-up attention based region-level features where each region is equally treated. (2) It limits the scale of negative sample pairs by employing the mini-batch based end-to-end training mechanism. To address these limitations, we propose a Unified Semantic Enhancement Momentum Contrastive Learning (USER) method for ITR. Specifically, we delicately design two simple but effective Global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsInfoNCE · Batch Normalization · Contrastive Language-Image Pre-training · Contrastive Learning · Momentum Contrast
