Captured by Captions: On Memorization and its Mitigation in CLIP Models
Wenhao Wang, Adam Dziedzic, Grace C. Kim, Michael Backes, Franziska Boenisch

TL;DR
This paper investigates how CLIP models memorize training data, introduces a formal definition of memorization in CLIP, and proposes strategies to mitigate memorization without sacrificing utility.
Contribution
It provides the first formal framework for understanding memorization in CLIP and demonstrates effective mitigation strategies that preserve model utility.
Findings
CLIP memorization is between supervised and self-supervised paradigms.
Mis-captioned samples show highest memorization levels.
Text encoder contributes more to memorization than image encoder.
Abstract
Multi-modal models, such as CLIP, have demonstrated strong performance in aligning visual and textual representations, excelling in tasks like image retrieval and zero-shot classification. Despite this success, the mechanisms by which these models utilize training data, particularly the role of memorization, remain unclear. In uni-modal models, both supervised and self-supervised, memorization has been shown to be essential for generalization. However, it is not well understood how these findings would apply to CLIP, which incorporates elements from both supervised learning via captions that provide a supervisory signal similar to labels, and from self-supervised learning via the contrastive objective. To bridge this gap in understanding, we propose a formal definition of memorization in CLIP (CLIPMem) and use it to quantify memorization in CLIP models. Our results indicate that CLIP's…
Peer Reviews
Decision·ICLR 2025 Poster
**Stong points:** - Well highlighted literature on memorization. - Defines CLIPMem based on hold one out strategy (similar to Feldman et. al in supervised learning). - Interesting results on mis-captioned text labels, multi-caption and removal of memorized examples. - Reasonable pretraining datasets like CC3M - Clean separation of training and test splits for measuring memorization.
**Weak points:** - Missing ability for CLIPMem to be applicable to general off-the-shelf CLIP models. Currently if I understand correctly it requires retraining on specific splits. - Clarity of specifics of CLIPMem is used for vision only and joint vision + text can be improved. - The noising results (Table 5-b) are not very convincing. Almost all the results are within the same +/- std range. - The linear probe accuracy seems quite low (Table 1, 5-a, 5-b, 6-a/b). **Nit:** - Text and images
+ It introduces a new metric -- CLIPMem to provide a new way for measuring memorization in multi-modal settings, a gap in previous research. + It performs empirical analysis to show differences in memorization between the text and image modalities, providing actionable insights. + It proposes techniques to successfully reduce memorization while preserving or even enhancing model utility, challenging established norms. + By highlighting the risks of training with uncurated, potentially mis-captio
- While tailored to CLIP, the metric and findings may need adaptation to apply effectively to other multi-modal models with different architectures. - The experiments focus on datasets like COCO and CC3M, so it’s unclear how well these findings generalize to other large-scale or domain-specific datasets. - The mitigation strategies, such as augmenting captions or generating variations, may incur additional computational costs in training, which could limit practicality for some users.
1. They propose a new metric to measure memorization in the CLIP model. The design of the metric is reasonable. 2. Their insight that memorization is more significant in the text encoder is new and might interest readers. 3. They conduct analysis on the augmentation of the text and images, which might be also interesting to some readers. 4. This paper is well-organized and easy to follow.
1. They do not discuss how model size can affect memorization. Although I am not very familiar with this topic, I guess the model size can affect their arguments. For example, if they utilize a larger image encoder, the memorization might be more significant on the image side. Therefore, I think their conclusion about which encoders suffer more from memorization can change by the size of the encoders, but they do not discuss much. 2. Most of their findings sound a bit too reasonable and are no
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsContrastive Language-Image Pre-training · Focus
