Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs
Aly M. Kassem, Omar Mahmoud, Niloofar Mireshghallah, Hyunwoo Kim,, Yulia Tsvetkov, Yejin Choi, Sherif Saad, Santu Rana

TL;DR
This paper presents a black-box prompt optimization method using an attacker LLM to reveal higher levels of memorization in victim models, surpassing traditional prompting approaches and exposing training data leakage.
Contribution
Introduces an iterative rejection-sampling prompt optimization technique to uncover memorization in LLMs, highlighting the effectiveness of instruction-based prompts and automated attack avenues.
Findings
Instruction-tuned models can leak training data as much as base models.
Contexts beyond training data can cause data leakage.
Using other LLMs' instructions enables automated memorization attacks.
Abstract
In this paper, we introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent, compared to what is revealed by prompting the target model with the training data directly, which is the dominant approach of quantifying memorization in LLMs. We use an iterative rejection-sampling optimization process to find instruction-based prompts with two main characteristics: (1) minimal overlap with the training data to avoid presenting the solution directly to the model, and (2) maximal overlap between the victim model's output and the training data, aiming to induce the victim to spit out training data. We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements. Our findings show that (1) instruction-tuned models can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsArtificial Intelligence in Law
