EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models
GuangHao Meng, Sunan He, Jinpeng Wang, Tao Dai, Letian Zhang, Jieming Zhu, Qing Li, Gang Wang, Rui Zhang, Yong Jiang

TL;DR
EvdCLIP enhances vision-language retrieval by integrating entity visual descriptions generated by large language models into queries, and employs a trainable rewriter to improve retrieval accuracy.
Contribution
The paper introduces EVDs from LLMs and a trainable EVD-aware rewriter to improve retrieval performance, addressing noise issues in query expansion.
Findings
EvdCLIP outperforms existing methods on benchmark datasets.
EVD integration significantly improves retrieval accuracy.
The trainable rewriter effectively reduces noise in queries.
Abstract
Vision-language retrieval (VLR) has attracted significant attention in both academia and industry, which involves using text (or images) as queries to retrieve corresponding images (or text). However, existing methods often neglect the rich visual semantics knowledge of entities, thus leading to incorrect retrieval results. To address this problem, we propose the Entity Visual Description enhanced CLIP (EvdCLIP), designed to leverage the visual knowledge of entities to enrich queries. Specifically, since humans recognize entities through visual cues, we employ a large language model (LLM) to generate Entity Visual Descriptions (EVDs) as alignment cues to complement textual data. These EVDs are then integrated into raw queries to create visually-rich, EVD-enhanced queries. Furthermore, recognizing that EVD-enhanced queries may introduce noise or low-quality expansions, we develop a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training
