EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models

GuangHao Meng; Sunan He; Jinpeng Wang; Tao Dai; Letian Zhang; Jieming Zhu; Qing Li; Gang Wang; Rui Zhang; Yong Jiang

arXiv:2505.18594·cs.CV·May 27, 2025

EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models

GuangHao Meng, Sunan He, Jinpeng Wang, Tao Dai, Letian Zhang, Jieming Zhu, Qing Li, Gang Wang, Rui Zhang, Yong Jiang

PDF

Open Access

TL;DR

EvdCLIP enhances vision-language retrieval by integrating entity visual descriptions generated by large language models into queries, and employs a trainable rewriter to improve retrieval accuracy.

Contribution

The paper introduces EVDs from LLMs and a trainable EVD-aware rewriter to improve retrieval performance, addressing noise issues in query expansion.

Findings

01

EvdCLIP outperforms existing methods on benchmark datasets.

02

EVD integration significantly improves retrieval accuracy.

03

The trainable rewriter effectively reduces noise in queries.

Abstract

Vision-language retrieval (VLR) has attracted significant attention in both academia and industry, which involves using text (or images) as queries to retrieve corresponding images (or text). However, existing methods often neglect the rich visual semantics knowledge of entities, thus leading to incorrect retrieval results. To address this problem, we propose the Entity Visual Description enhanced CLIP (EvdCLIP), designed to leverage the visual knowledge of entities to enrich queries. Specifically, since humans recognize entities through visual cues, we employ a large language model (LLM) to generate Entity Visual Descriptions (EVDs) as alignment cues to complement textual data. These EVDs are then integrated into raw queries to create visually-rich, EVD-enhanced queries. Furthermore, recognizing that EVD-enhanced queries may introduce noise or low-quality expansions, we develop a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training