Quantifying Memorization and Parametric Response Rates in Retrieval-Augmented Vision-Language Models
Peter Carragher, Abhinand Jha, R Raghav, Kathleen M. Carley

TL;DR
This paper investigates how retrieval-augmented vision-language models memorize data versus retrieve information, proposing metrics to quantify memorization, and compares parametric response rates between text and visual modalities.
Contribution
It introduces proxy metrics for memorization in multimodal models and provides the first empirical comparison of parametric effects across text and image modalities.
Findings
Finetuned models rely more on memorization than retrieval-augmented models.
Finetuned models achieve higher accuracy (72% vs 52%) on WebQA.
Image-based questions have 15-25% higher parametric response rates than text-based questions.
Abstract
Large Language Models (LLMs) demonstrate remarkable capabilities in question answering (QA), but metrics for assessing their reliance on memorization versus retrieval remain underdeveloped. Moreover, while finetuned models are state-of-the-art on closed-domain tasks, general-purpose models like GPT-4o exhibit strong zero-shot performance. This raises questions about the trade-offs between memorization, generalization, and retrieval. In this work, we analyze the extent to which multimodal retrieval-augmented VLMs memorize training data compared to baseline VLMs. Using the WebQA benchmark, we contrast finetuned models with baseline VLMs on multihop retrieval and question answering, examining the impact of finetuning on data memorization. To quantify memorization in end-to-end retrieval and QA systems, we propose several proxy metrics by investigating instances where QA succeeds despite…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
