Revising Image-Text Retrieval via Multi-Modal Entailment

Xu Yan; Chunhui Ai; Ziqiang Cao; Min Cao; Sujian Li; Wenjie Li,; Guohong Fu

arXiv:2208.10126·cs.CV·September 2, 2022·1 cites

Revising Image-Text Retrieval via Multi-Modal Entailment

Xu Yan, Chunhui Ai, Ziqiang Cao, Min Cao, Sujian Li, Wenjie Li,, Guohong Fu

PDF

Open Access

TL;DR

This paper introduces a multi-modal entailment classifier to improve image-text retrieval datasets by identifying captions that are truly entailed by images, leading to better training and evaluation accuracy.

Contribution

The paper proposes a novel multi-modal entailment classifier and dataset revision method to address many-to-many caption-image matching issues in retrieval datasets.

Findings

01

Entailment classifier achieves about 78% accuracy.

02

Revised datasets improve retrieval model performance.

03

Enhanced dataset quality reduces confusion during training.

Abstract

An outstanding image-text retrieval model depends on high-quality labeled data. While the builders of existing image-text retrieval datasets strive to ensure that the caption matches the linked image, they cannot prevent a caption from fitting other images. We observe that such a many-to-many matching phenomenon is quite common in the widely-used retrieval datasets, where one caption can describe up to 178 images. These large matching-lost data not only confuse the model in training but also weaken the evaluation accuracy. Inspired by visual and textual entailment tasks, we propose a multi-modal entailment classifier to determine whether a sentence is entailed by an image plus its linked captions. Subsequently, we revise the image-text retrieval datasets by adding these entailed captions as additional weak labels of an image and develop a universal variable learning rate strategy to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning