VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering
Chun-Mei Feng, Yang Bai, Tao Luo, Zhen Li, Salman Khan, Wangmeng Zuo,, Xinxing Xu, Rick Siow Mong Goh, Yong Liu

TL;DR
This paper introduces VQA4CIR, a post-processing method that enhances composed image retrieval by using visual question answering to identify and re-rank inconsistent retrieval results, improving accuracy on benchmark datasets.
Contribution
The work proposes a novel VQA-based post-processing approach for CIR that can be integrated with existing methods to reduce retrieval errors caused by caption-image inconsistency.
Findings
Outperforms state-of-the-art CIR methods on CIRR and Fashion-IQ datasets.
Effectively identifies inconsistent images using a QA generation and VQA verification pipeline.
Boosts CIR performance by re-ranking retrieved images based on VQA consistency.
Abstract
Albeit progress has been made in Composed Image Retrieval (CIR), we empirically find that a certain percentage of failure retrieval results are not consistent with their relative captions. To address this issue, this work provides a Visual Question Answering (VQA) perspective to boost the performance of CIR. The resulting VQA4CIR is a post-processing approach and can be directly plugged into existing CIR methods. Given the top-C retrieved images by a CIR method, VQA4CIR aims to decrease the adverse effect of the failure retrieval results being inconsistent with the relative caption. To find the retrieved images inconsistent with the relative caption, we resort to the "QA generation to VQA" self-verification pipeline. For QA generation, we suggest fine-tuning LLM (e.g., LLaMA) to generate several pairs of questions and answers from each relative caption. We then fine-tune LVLM (e.g.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
