ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods
Roy Xie, Junlin Wang, Ruomin Huang, Minxing Zhang, Rong Ge, Jian Pei, Neil Zhenqiang Gong, Bhuwan Dhingra

TL;DR
ReCaLL is a new membership inference attack that detects whether data was used in training large language models by analyzing changes in conditional log-likelihoods, achieving state-of-the-art results.
Contribution
The paper introduces ReCaLL, a novel method leveraging relative conditional log-likelihoods for effective membership inference on LLMs, with comprehensive experimental validation.
Findings
ReCaLL outperforms existing MIAs on WikiMIA dataset.
Conditioning on non-member prefixes causes larger log-likelihood decreases for members.
Ensemble approaches further improve ReCaLL's inference accuracy.
Abstract
The rapid scaling of large language models (LLMs) has raised concerns about the transparency and fair use of the data used in their pretraining. Detecting such content is challenging due to the scale of the data and limited exposure of each instance during training. We propose ReCaLL (Relative Conditional Log-Likelihood), a novel membership inference attack (MIA) to detect LLMs' pretraining data by leveraging their conditional language modeling capabilities. ReCaLL examines the relative change in conditional log-likelihoods when prefixing target data points with non-member context. Our empirical findings show that conditioning member data on non-member prefixes induces a larger decrease in log-likelihood compared to non-member data. We conduct comprehensive experiments and show that ReCaLL achieves state-of-the-art performance on the WikiMIA dataset, even with random and synthetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsData Quality and Management
