Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval

Siting Li; Xiang Gao; Simon Shaolei Du

arXiv:2505.15877·cs.CV·October 15, 2025

Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval

Siting Li, Xiang Gao, Simon Shaolei Du

PDF

Open Access 1 Video

TL;DR

This paper introduces promptable embeddings for attribute-focused image retrieval, demonstrating improved performance over traditional global embeddings by highlighting relevant attributes, with scalable strategies for real-world deployment.

Contribution

It proposes a novel promptable embedding approach that enhances attribute-specific retrieval, addressing limitations of existing global embedding methods in handling detailed queries.

Findings

01

Promptable embeddings improve Recall@5 by 15% with pre-defined prompts.

02

Linear approximation of embeddings yields an 8% improvement during inference.

03

Current CLIP-like and MLLM-based retrievers struggle with attribute-focused queries.

Abstract

While an image is worth more than a thousand words, only a few provide crucial information for a given task and thus should be focused on. In light of this, ideal text-to-image (T2I) retrievers should prioritize specific visual attributes relevant to queries. To evaluate current retrievers on handling attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with 9,112 queries about diverse attributes of interest. We find that CLIP-like retrievers, which are widely adopted due to their efficiency and zero-shot ability, have poor and imbalanced performance, possibly because their image embeddings focus on global semantics and subjects while leaving out other details. Notably, we reveal that even recent Multimodal Large Language Model (MLLM)-based, stronger retrievers with a larger output dimension struggle with this limitation. Hence, we hypothesize that retrieving with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval· slideslive

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsFocus · Balanced Selection