Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs
Jialiang Xu, Michael Moor, Jure Leskovec

TL;DR
This paper introduces Reverse Image Retrieval (RIR) augmentation for multimodal large language models, significantly enhancing their knowledge-intensive visual question answering capabilities by leveraging web-scale reverse image search results.
Contribution
The paper demonstrates that RIR augmentation improves VQA performance and helps models access their own knowledge, revealing new ways to enhance multimodal LLMs with retrieval techniques.
Findings
RIR improves GPT-4V VQA accuracy by 37-43%
RIR enhances GPT-4 Turbo VQA by 25-27%
RIR can sometimes negatively impact performance
Abstract
Despite impressive advances in recent multimodal large language models (MLLMs), state-of-the-art models such as from the GPT-4 suite still struggle with knowledge-intensive tasks. To address this, we consider Reverse Image Retrieval (RIR) augmented generation, a simple yet effective strategy to augment MLLMs with web-scale reverse image search results. RIR robustly improves knowledge-intensive visual question answering (VQA) of GPT-4V by 37-43%, GPT-4 Turbo by 25-27%, and GPT-4o by 18-20% in terms of open-ended VQA evaluation metrics. To our surprise, we discover that RIR helps the model to better access its own world knowledge. Concretely, our experiments suggest that RIR augmentation helps by providing further visual and textual cues without necessarily containing the direct answer to a query. In addition, we elucidate cases in which RIR can hurt performance and conduct a human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
