Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs

Jialiang Xu; Michael Moor; Jure Leskovec

arXiv:2405.18740·cs.CL·May 30, 2024

Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs

Jialiang Xu, Michael Moor, Jure Leskovec

PDF

Open Access 1 Repo

TL;DR

This paper introduces Reverse Image Retrieval (RIR) augmentation for multimodal large language models, significantly enhancing their knowledge-intensive visual question answering capabilities by leveraging web-scale reverse image search results.

Contribution

The paper demonstrates that RIR augmentation improves VQA performance and helps models access their own knowledge, revealing new ways to enhance multimodal LLMs with retrieval techniques.

Findings

01

RIR improves GPT-4V VQA accuracy by 37-43%

02

RIR enhances GPT-4 Turbo VQA by 25-27%

03

RIR can sometimes negatively impact performance

Abstract

Despite impressive advances in recent multimodal large language models (MLLMs), state-of-the-art models such as from the GPT-4 suite still struggle with knowledge-intensive tasks. To address this, we consider Reverse Image Retrieval (RIR) augmented generation, a simple yet effective strategy to augment MLLMs with web-scale reverse image search results. RIR robustly improves knowledge-intensive visual question answering (VQA) of GPT-4V by 37-43%, GPT-4 Turbo by 25-27%, and GPT-4o by 18-20% in terms of open-ended VQA evaluation metrics. To our surprise, we discover that RIR helps the model to better access its own world knowledge. Concretely, our experiments suggest that RIR augmentation helps by providing further visual and textual cues without necessarily containing the direct answer to a query. In addition, we elucidate cases in which RIR can hurt performance and conduct a human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mi92/reverse-image-rag
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections