Seeing Through Words: Controlling Visual Retrieval Quality with Language Models

Jianglin Lu; Simon Jenni; Kushal Kafle; Jing Shi; Handong Zhao; Yun Fu

arXiv:2602.21175·cs.CV·February 25, 2026

Seeing Through Words: Controlling Visual Retrieval Quality with Language Models

Jianglin Lu, Simon Jenni, Kushal Kafle, Jing Shi, Handong Zhao, Yun Fu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a method to improve text-to-image retrieval by using language models to enrich short queries with detailed, quality-aware descriptions, enhancing retrieval accuracy and user control.

Contribution

It proposes a novel framework that leverages generative language models for query enrichment with explicit quality control, compatible with any pretrained vision-language model.

Findings

01

Significant improvement in retrieval accuracy.

02

Effective control over image quality in retrieval results.

03

Enhanced interpretability of enriched queries.

Abstract

Text-to-image retrieval is a fundamental task in vision-language learning, yet in real-world scenarios it is often challenged by short and underspecified user queries. Such queries are typically only one or two words long, rendering them semantically ambiguous, prone to collisions across diverse visual interpretations, and lacking explicit control over the quality of retrieved images. To address these issues, we propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality. Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms that capture fine-grained visual attributes such as pose, scene, and aesthetics. We introduce a general framework that conditions query completion on discretized quality…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The motivation is clear. The paper introduces aesthetic cues to explicitly control and improve retrieval quality. 2. The method is plug-and-play and easy to apply in existing systems, requiring no modification to the visual model.

Weaknesses

1. Experiments mainly use single-word, single-object queries（e.g., “a dog”）. Real retrieval queries are usually longer, involve multiple objects and relations. Current setup looks more like keyword/entity retrieval. 2. The paper doesn't report preprocessing cost or inference latency, so it's hard to judge the efficiency of method. 3. The approach assumes the database has many visually similar images with different aesthetic qualities (like Flickr30k/COCO). In many datasets this won’t hold (Visua

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper proposes a new task: quality-controllable text-to-image retrieval, which considers both semantic relevance and aesthetic quality during retrieval. This setting aligns well with real-world search scenarios. 2. The method uses a large language model to generate query expansions with quality cues, requiring no modification to existing vision–language models and remaining relatively simple to implement.

Weaknesses

1. The paper mainly focuses on short queries but does not evaluate on datasets with long or descriptive queries. When the textual input already contains sufficient information, the effect of query completion may diminish or even introduce redundancy or semantic drift. Will this strategy still work for underspecified long user query? 2. The paper's quantitative evaluation relies on a limited set of test queries, namely 80 concrete object nouns. Consequently, the method's performance in handling m

Reviewer 03Rating 4Confidence 4

Strengths

- The paper has a good motivation, addressing the challenge of underspecified queries that often lead to ambiguous retrieval results. Quality-controllable retrieval is both practically useful and of research interest. - The paper is well-structured, and the methodology is described in sufficient detail.

Weaknesses

- While the concept of quality-controllable retrieval is attractive, the current work only explores two quality dimensions (relevance and aesthetics). This limited scope is insufficient to demonstrate the true practical value of controllable retrieval in applications. - Despite the claim that LLM-based completions avoid irrelevant or hallucinated content, there is no explicit mechanism or evaluation to manage or detect query artifacts that could mislead retrieval or introduce out-of-distribution

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques