Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings

Bumjun Kim; Albert No

arXiv:2605.02908·cs.CV·May 6, 2026

Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings

Bumjun Kim, Albert No

PDF

TL;DR

This paper uncovers that CLIP embeddings in Stable Diffusion unexpectedly cause memorization, mainly due to structural duplication of certain embeddings, and proposes simple mitigation strategies.

Contribution

It reveals the specific role of CLIP embeddings in memorization and introduces effective inference-time mitigation techniques.

Findings

01

CLIP prompt embeddings minimally impact memorization

02

Structural duplication of <pad> embeddings amplifies memorization

03

Proposed mitigation strategies effectively reduce memorization

Abstract

Understanding how textual embeddings contribute to memorization in text-to-image diffusion models is crucial for both interpretability and safety. This paper investigates an unexpected behavior of CLIP embeddings in Stable Diffusion, revealing that the model disproportionately relies on specific embeddings. We categorize input tokens as <startoftext>, <prompt>, <endoftext> and <pad> with corresponding embeddings $v^{sot}, v^{pr}, v^{eot}, v^{pad}$ . We discover that $v^{pr}$ contribute minimally to generation in memorized cases. In contrast, $v^{pad}$ strongly affect memorization due to their structural duplication of $v^{eot}$ , the only embedding explicitly optimized during CLIP training. This duplication unintentionally amplifies the influence of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.