VQA4CIR: Boosting Composed Image Retrieval with Visual Question   Answering

Chun-Mei Feng; Yang Bai; Tao Luo; Zhen Li; Salman Khan; Wangmeng Zuo,; Xinxing Xu; Rick Siow Mong Goh; Yong Liu

arXiv:2312.12273·cs.CV·December 20, 2023·2 cites

VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering

Chun-Mei Feng, Yang Bai, Tao Luo, Zhen Li, Salman Khan, Wangmeng Zuo,, Xinxing Xu, Rick Siow Mong Goh, Yong Liu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces VQA4CIR, a post-processing method that enhances composed image retrieval by using visual question answering to identify and re-rank inconsistent retrieval results, improving accuracy on benchmark datasets.

Contribution

The work proposes a novel VQA-based post-processing approach for CIR that can be integrated with existing methods to reduce retrieval errors caused by caption-image inconsistency.

Findings

01

Outperforms state-of-the-art CIR methods on CIRR and Fashion-IQ datasets.

02

Effectively identifies inconsistent images using a QA generation and VQA verification pipeline.

03

Boosts CIR performance by re-ranking retrieved images based on VQA consistency.

Abstract

Albeit progress has been made in Composed Image Retrieval (CIR), we empirically find that a certain percentage of failure retrieval results are not consistent with their relative captions. To address this issue, this work provides a Visual Question Answering (VQA) perspective to boost the performance of CIR. The resulting VQA4CIR is a post-processing approach and can be directly plugged into existing CIR methods. Given the top-C retrieved images by a CIR method, VQA4CIR aims to decrease the adverse effect of the failure retrieval results being inconsistent with the relative caption. To find the retrieved images inconsistent with the relative caption, we resort to the "QA generation to VQA" self-verification pipeline. For QA generation, we suggest fine-tuning LLM (e.g., LLaMA) to generate several pairs of questions and answers from each relative caption. We then fine-tune LVLM (e.g.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chunmeifeng/vqa4cir
noneOfficial

Videos

VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning