Enhancing Interactive Image Retrieval With Query Rewriting Using Large   Language Models and Vision Language Models

Hongyi Zhu; Jia-Hong Huang; Stevan Rudinac; and Evangelos Kanoulas

arXiv:2404.18746·cs.MM·April 30, 2024

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

Hongyi Zhu, Jia-Hong Huang, Stevan Rudinac, and Evangelos Kanoulas

PDF

TL;DR

This paper introduces an interactive image retrieval system that leverages large language models and vision-language models to iteratively refine queries, significantly improving recall and addressing semantic gaps in traditional methods.

Contribution

The paper presents a novel multi-turn interactive image retrieval system integrating LLM-based query denoising and VLM-based captioning, along with a new dataset for evaluation.

Findings

01

Achieved a 10% improvement in recall over baseline methods.

02

Validated the effectiveness of query refinement through experiments.

03

Demonstrated state-of-the-art performance in image retrieval tasks.

Abstract

Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.