Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models
Hongyi Zhu, Jia-Hong Huang, Stevan Rudinac, and Evangelos Kanoulas

TL;DR
This paper introduces an interactive image retrieval system that leverages large language models and vision-language models to iteratively refine queries, significantly improving recall and addressing semantic gaps in traditional methods.
Contribution
The paper presents a novel multi-turn interactive image retrieval system integrating LLM-based query denoising and VLM-based captioning, along with a new dataset for evaluation.
Findings
Achieved a 10% improvement in recall over baseline methods.
Validated the effectiveness of query refinement through experiments.
Demonstrated state-of-the-art performance in image retrieval tasks.
Abstract
Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
