TL;DR
This paper introduces methods to create unbiased datasets for evaluating Membership Inference Attacks on Large Language Models, revealing that bias removal reduces attack effectiveness and highlights the importance of dataset fairness.
Contribution
It proposes algorithms for constructing non-biased datasets for fairer MIA evaluation on LLMs, addressing distributional biases in ex-post assessments.
Findings
Bias removal diminishes MIA effectiveness
Non-biased datasets yield AUC-ROC scores similar to random datasets
Most MIAs perform near random when biases are neutralized
Abstract
The rise of Large Language Models (LLMs) has triggered legal and ethical concerns, especially regarding the unauthorized use of copyrighted materials in their training datasets. This has led to lawsuits against tech companies accused of using protected content without permission. Membership Inference Attacks (MIAs) aim to detect whether specific documents were used in a given LLM pretraining, but their effectiveness is undermined by biases such as time-shifts and n-gram overlaps. This paper addresses the evaluation of MIAs on LLMs with partially inferable training sets, under the ex-post hypothesis, which acknowledges inherent distributional biases between members and non-members datasets. We propose and validate algorithms to create ``non-biased'' and ``non-classifiable'' datasets for fairer MIA assessment. Experiments using the Gutenberg dataset on OpenLamma and Pythia show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
