Pop Quiz Attack: Black-box Membership Inference Attacks Against Large Language Models
Zeyuan Chen, Yihan Ma, Xinyue Shen, Michael Backes, Yang Zhang

TL;DR
This paper introduces PopQuiz, a black-box membership inference attack on large language models that uses quiz-style questions to reveal training data, exposing privacy vulnerabilities.
Contribution
The paper presents a novel black-box attack method, PopQuiz, that significantly outperforms existing techniques in exposing training data in large language models.
Findings
Achieves an average ROC-AUC of 0.873 across models and datasets.
Outperforms existing approaches by 20.6%.
Defense methods reduce attack success but do not eliminate privacy risks.
Abstract
Large language models (LLMs) show strong performance across many applications, but their ability to memorize and potentially reveal training data raises serious privacy concerns. We introduce the PopQuiz Attack, a black-box membership inference attack that tests whether a model can recall specific training examples. The core idea is to turn target data into quiz-style multiple-choice questions and infer membership from the model's answers. Across six widely used LLMs (GPT-3.5, GPT-4o, LLaMA2-7b, LLaMA2-13b, Mistral-7b, and Vicuna-7b) and four datasets, our method achieves an average ROC-AUC of 0.873 and outperforms existing approaches by 20.6%. We further analyze factors affecting attack success, including query complexity, data type, data structure, and training settings. We also evaluate instruction-based, filter-based, and differential privacy-based defenses, which reduce performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
