Eliciting Language Model Behaviors with Investigator Agents

Xiang Lisa Li; Neil Chowdhury; Daniel D. Johnson; Tatsunori Hashimoto,; Percy Liang; Sarah Schwettmann; Jacob Steinhardt

arXiv:2502.01236·cs.LG·February 4, 2025

Eliciting Language Model Behaviors with Investigator Agents

Xiang Lisa Li, Neil Chowdhury, Daniel D. Johnson, Tatsunori Hashimoto,, Percy Liang, Sarah Schwettmann, Jacob Steinhardt

PDF

Open Access 1 Video

TL;DR

This paper introduces investigator models trained to discover prompts that induce specific behaviors in language models, enabling systematic behavior elicitation and analysis of complex model responses.

Contribution

We propose a novel approach using investigator models with supervised fine-tuning, reinforcement learning, and a Frank-Wolfe objective to efficiently find diverse prompts for target behaviors.

Findings

01

Achieved 100% success in eliciting harmful behaviors on AdvBench

02

Reached 85% hallucination rate in targeted prompts

03

Demonstrated diverse, human-interpretable prompting strategies

Abstract

Language models exhibit complex, diverse behaviors when prompted with free-form text, making it difficult to characterize the space of possible outputs. We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations or harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them, similar to amortized Bayesian inference. We do this through supervised fine-tuning, reinforcement learning via DPO, and a novel Frank-Wolfe training objective to iteratively discover diverse prompting strategies. Our investigator models surface a variety of effective and human-interpretable prompts leading to jailbreaks, hallucinations, and open-ended…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Eliciting Language Model Behaviors with Investigator Agents· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsDirect Preference Optimization