Synthetic Data Can Mislead Evaluations: Membership Inference as Machine Text Detection
Ali Naseh, Niloofar Mireshghallah

TL;DR
This paper reveals that membership inference attacks on large language models are unreliable when synthetic data is used, as they tend to misclassify synthetic text as training data, misleading evaluations of model memorization.
Contribution
It demonstrates that MIAs act as machine-generated text detectors, highlighting a fundamental flaw in using synthetic data for membership inference evaluations.
Findings
MIAs often misclassify synthetic data as training data.
Synthetic data can lead to false conclusions about model memorization.
The issue persists across various model architectures and sizes.
Abstract
Recent work shows membership inference attacks (MIAs) on large language models (LLMs) produce inconclusive results, partly due to difficulties in creating non-member datasets without temporal shifts. While researchers have turned to synthetic data as an alternative, we show this approach can be fundamentally misleading. Our experiments indicate that MIAs function as machine-generated text detectors, incorrectly identifying synthetic data as training samples regardless of the data source. This behavior persists across different model architectures and sizes, from open-source models to commercial ones such as GPT-3.5. Even synthetic text generated by different, potentially larger models is classified as training data by the target model. Our findings highlight a serious concern: using synthetic data in membership evaluations may lead to false conclusions about model memorization and data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Linear Layer · Weight Decay · Multi-Head Attention · Layer Normalization · Dense Connections · Cosine Annealing
