Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval

Jiacheng Cheng; Hijung Valentina Shin; Nuno Vasconcelos; Bryan; Russell; Fabian Caba Heilbron

arXiv:2405.03190·cs.CV·May 7, 2024

Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval

Jiacheng Cheng, Hijung Valentina Shin, Nuno Vasconcelos, Bryan, Russell, Fabian Caba Heilbron

PDF

Open Access

TL;DR

This paper addresses the challenge of inconsistent retrieval results for paraphrased queries in vision-language models, proposing a dataset and training strategies to improve semantic consistency in paraphrased text-to-image retrieval.

Contribution

It introduces a new dataset for paraphrased image descriptions and develops training strategies to enhance dual-encoder models' semantic understanding of paraphrases.

Findings

01

Improved retrieval consistency for paraphrased queries.

02

Maintained zero-shot classification and retrieval accuracy.

03

Significantly higher ranking similarity for paraphrased queries.

Abstract

In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsContrastive Language-Image Pre-training