Revising Image-Text Retrieval via Multi-Modal Entailment
Xu Yan, Chunhui Ai, Ziqiang Cao, Min Cao, Sujian Li, Wenjie Li,, Guohong Fu

TL;DR
This paper introduces a multi-modal entailment classifier to improve image-text retrieval datasets by identifying captions that are truly entailed by images, leading to better training and evaluation accuracy.
Contribution
The paper proposes a novel multi-modal entailment classifier and dataset revision method to address many-to-many caption-image matching issues in retrieval datasets.
Findings
Entailment classifier achieves about 78% accuracy.
Revised datasets improve retrieval model performance.
Enhanced dataset quality reduces confusion during training.
Abstract
An outstanding image-text retrieval model depends on high-quality labeled data. While the builders of existing image-text retrieval datasets strive to ensure that the caption matches the linked image, they cannot prevent a caption from fitting other images. We observe that such a many-to-many matching phenomenon is quite common in the widely-used retrieval datasets, where one caption can describe up to 178 images. These large matching-lost data not only confuse the model in training but also weaken the evaluation accuracy. Inspired by visual and textual entailment tasks, we propose a multi-modal entailment classifier to determine whether a sentence is entailed by an image plus its linked captions. Subsequently, we revise the image-text retrieval datasets by adding these entailed captions as additional weak labels of an image and develop a universal variable learning rate strategy to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
