Blackbox Model Provenance via Palimpsestic Membership Inference
Rohith Kuditipudi, Jing Huang, Sally Zhu, Diyi Yang, Christopher Potts, Percy Liang

TL;DR
This paper introduces a statistical method to determine whether a blackbox language model or generated text originates from a specific training run, leveraging palimpsestic memorization and correlation testing.
Contribution
It formulates the provenance verification as an independence test and demonstrates effective detection of model usage through correlation analysis of training data order.
Findings
High statistical significance in query setting with p-value ~1e-8
Reliable detection of Bob's text with as little as a few hundred tokens
Effective distinction between models trained on original vs. reshuffled data
Abstract
Suppose Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice's model to produce text. Can Alice prove that Bob is using her model, either by querying Bob's derivative model (query setting) or from the text alone (observational setting)? We formulate this question as an independence testing problem--in which the null hypothesis is that Bob's model or text is independent of Alice's randomized training run--and investigate it through the lens of palimpsestic memorization in language models: models are more likely to memorize data seen later in training, so we can test whether Bob is using Alice's model using test statistics that capture correlation between Bob's model or text and the ordering of training examples in Alice's training run. If Alice has randomly shuffled her training data, then any significant correlation amounts to exactly quantifiable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Scientific Computing and Data Management
