Synthetic Data Can Mislead Evaluations: Membership Inference as Machine   Text Detection

Ali Naseh; Niloofar Mireshghallah

arXiv:2501.11786·cs.CL·January 22, 2025

Synthetic Data Can Mislead Evaluations: Membership Inference as Machine Text Detection

Ali Naseh, Niloofar Mireshghallah

PDF

Open Access

TL;DR

This paper reveals that membership inference attacks on large language models are unreliable when synthetic data is used, as they tend to misclassify synthetic text as training data, misleading evaluations of model memorization.

Contribution

It demonstrates that MIAs act as machine-generated text detectors, highlighting a fundamental flaw in using synthetic data for membership inference evaluations.

Findings

01

MIAs often misclassify synthetic data as training data.

02

Synthetic data can lead to false conclusions about model memorization.

03

The issue persists across various model architectures and sizes.

Abstract

Recent work shows membership inference attacks (MIAs) on large language models (LLMs) produce inconclusive results, partly due to difficulties in creating non-member datasets without temporal shifts. While researchers have turned to synthetic data as an alternative, we show this approach can be fundamentally misleading. Our experiments indicate that MIAs function as machine-generated text detectors, incorrectly identifying synthetic data as training samples regardless of the data source. This behavior persists across different model architectures and sizes, from open-source models to commercial ones such as GPT-3.5. Even synthetic text generated by different, potentially larger models is classified as training data by the target model. Our findings highlight a serious concern: using synthetic data in membership evaluations may lead to false conclusions about model memorization and data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Linear Layer · Weight Decay · Multi-Head Attention · Layer Normalization · Dense Connections · Cosine Annealing