Quantifying Cross-Modality Memorization in Vision-Language Models

Yuxin Wen; Yangsibo Huang; Tom Goldstein; Ravi Kumar; Badih Ghazi; Chiyuan Zhang

arXiv:2506.05198·cs.CV·June 6, 2025

Quantifying Cross-Modality Memorization in Vision-Language Models

Yuxin Wen, Yangsibo Huang, Tom Goldstein, Ravi Kumar, Badih Ghazi, Chiyuan Zhang

PDF

Open Access

TL;DR

This paper systematically investigates cross-modality memorization in vision-language models, revealing transferability of facts between modalities and identifying gaps, with implications for improving multimodal learning robustness.

Contribution

It introduces a synthetic dataset for controlled experiments and analyzes how knowledge transfers across modalities in vision-language models, highlighting existing gaps and proposing mitigation methods.

Findings

01

Facts learned in one modality transfer to the other.

02

Significant gap exists between source and target modality recall.

03

Transferability gap persists across various scenarios.

Abstract

Understanding what and how neural networks memorize during training is crucial, both from the perspective of unintentional memorization of potentially sensitive information and from the standpoint of effective knowledge acquisition for real-world, knowledge-intensive tasks. While previous studies primarily investigate memorization within a single modality, such as text memorization in large language models or image memorization in diffusion models, unified multimodal models are becoming increasingly prevalent in practical applications. In this work, we focus on the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models. To facilitate controlled experiments, we first introduce a synthetic persona dataset comprising diverse synthetic person images and textual descriptions. We quantify factual knowledge memorization and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Text Readability and Simplification

MethodsDiffusion · Focus