Extracting Prompts by Inverting LLM Outputs

Collin Zhang; John X. Morris; Vitaly Shmatikov

arXiv:2405.15012·cs.CL·October 10, 2024

Extracting Prompts by Inverting LLM Outputs

Collin Zhang, John X. Morris, Vitaly Shmatikov

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces output2prompt, a black-box method for extracting prompts from language model outputs without needing access to model internals, improving efficiency and transferability.

Contribution

The paper presents a novel zero-shot prompt extraction technique that operates solely on model outputs and employs sparse encoding for memory efficiency.

Findings

01

Effective prompt extraction across multiple LLMs

02

Zero-shot transferability demonstrated

03

Memory-efficient sparse encoding technique

Abstract

We consider the problem of language model inversion: given outputs of a language model, we seek to extract the prompt that generated these outputs. We develop a new black-box method, output2prompt, that learns to extract prompts without access to the model's logits and without adversarial or jailbreaking queries. In contrast to previous work, output2prompt only needs outputs of normal user queries. To improve memory efficiency, output2prompt employs a new sparse encoding techique. We measure the efficacy of output2prompt on a variety of user and system prompts and demonstrate zero-shot transferability across different LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

collinzrj/output2prompt
pytorchOfficial

Videos

Extracting Prompts by Inverting LLM Outputs· underline

Taxonomy

TopicsNatural Language Processing Techniques