No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings

Joonsung Jeon; Woo Jae Kim; Suhyeon Ha; Sooel Son; Sung-Eui Yoon

arXiv:2602.22689·cs.CV·February 27, 2026

No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings

Joonsung Jeon, Woo Jae Kim, Suhyeon Ha, Sooel Son, Sung-Eui Yoon

PDF

Open Access 3 Reviews

TL;DR

This paper introduces MoFit, a novel caption-free membership inference attack that constructs synthetic conditioning inputs to determine if a sample was part of a diffusion model's training data, addressing scenarios without ground-truth captions.

Contribution

MoFit is the first caption-free MIA framework that overcomes the reliance on ground-truth captions by synthesizing overfitted embeddings, improving privacy auditing of diffusion models.

Findings

01

MoFit outperforms prior VLM-conditioned baselines.

02

Achieves performance comparable to caption-dependent methods.

03

Effective across multiple datasets and models.

Abstract

Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model's generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

By tackling the MIA problem in the absence of ground-truth captions, the work fills a clear gap in current literature, reflecting realistic adversarial constraints faced in practice.

Weaknesses

1. Limited diversity in experiments with large models: While SD v1.5 is included, the exploration of truly large-scale public models (e.g., with more challenging real-world splits or harder negative scenarios) is limited. The LAION-mi split required special curation and so does not test MoFit “in the wild” under extreme generalization. 2. The paper focuses heavily on the positive results of MoFit, but does not sufficiently examine its limitations, potential false-positive drivers (e.g., for ou

Reviewer 02Rating 4Confidence 3

Strengths

1. Fills the "caption-free" gap, fits real scenarios with undisclosed training annotations, and has practical value. 2. Based on member samples’ higher sensitivity to mismatched conditions, its two-stage optimization is logically consistent. 3. Outperforms VLM-based baselines across datasets/models, and surpasses caption-dependent methods in some cases.

Weaknesses

1. Its two-stage optimization takes 7-9 minutes per image on RTX 4090, far slower than VLM-based baselines, unable to handle large-scale data. 2. No tests on non-ideal images or small-sample LDMs; key parameters are only validated on Pokemon, lacking cross-scenario adaptability. 3. It excludes recent caption-free MIA studies, uses only 2 VLMs as baselines, and lacks theoretical explanation for member samples’ sensitivity.

Reviewer 03Rating 4Confidence 3

Strengths

- The problem being studied is practical. Privacy risks in generative diffusion models are an important topic. The work addresses a realistic limitation in existing MIAs by removing the dependency on ground-truth captions, making it applicable to real-world scenarios. - Strong empirical performance: The results show consistent improvements over VLM-substituted baselines across multiple datasets and models. - Comprehensive analysis: The paper provides detailed analysis and discussion of differe

Weaknesses

- Despite arguing that the proposed method has a more practical setting, it still relies on internal model signals (e.g., likelihoods or denoising trajectories required by CLiD) that may not translate to black-box APIs. While common in MIA research, this reduces practicality in real-world applications. - The experiments focus on fine-tuned models on relatively small datasets (e.g., Pokémon with ~800 images), so generalization to large-scale, pre-trained models is unclear. It is important to co

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning