Quantifying Memorization and Parametric Response Rates in Retrieval-Augmented Vision-Language Models

Peter Carragher; Abhinand Jha; R Raghav; Kathleen M. Carley

arXiv:2502.13836·cs.LG·June 17, 2025

Quantifying Memorization and Parametric Response Rates in Retrieval-Augmented Vision-Language Models

Peter Carragher, Abhinand Jha, R Raghav, Kathleen M. Carley

PDF

Open Access

TL;DR

This paper investigates how retrieval-augmented vision-language models memorize data versus retrieve information, proposing metrics to quantify memorization, and compares parametric response rates between text and visual modalities.

Contribution

It introduces proxy metrics for memorization in multimodal models and provides the first empirical comparison of parametric effects across text and image modalities.

Findings

01

Finetuned models rely more on memorization than retrieval-augmented models.

02

Finetuned models achieve higher accuracy (72% vs 52%) on WebQA.

03

Image-based questions have 15-25% higher parametric response rates than text-based questions.

Abstract

Large Language Models (LLMs) demonstrate remarkable capabilities in question answering (QA), but metrics for assessing their reliance on memorization versus retrieval remain underdeveloped. Moreover, while finetuned models are state-of-the-art on closed-domain tasks, general-purpose models like GPT-4o exhibit strong zero-shot performance. This raises questions about the trade-offs between memorization, generalization, and retrieval. In this work, we analyze the extent to which multimodal retrieval-augmented VLMs memorize training data compared to baseline VLMs. Using the WebQA benchmark, we contrast finetuned models with baseline VLMs on multihop retrieval and question answering, examining the impact of finetuning on data memorization. To quantify memorization in end-to-end retrieval and QA systems, we propose several proxy metrics by investigating instances where QA succeeds despite…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications