Eliciting Language Model Behaviors with Investigator Agents
Xiang Lisa Li, Neil Chowdhury, Daniel D. Johnson, Tatsunori Hashimoto,, Percy Liang, Sarah Schwettmann, Jacob Steinhardt

TL;DR
This paper introduces investigator models trained to discover prompts that induce specific behaviors in language models, enabling systematic behavior elicitation and analysis of complex model responses.
Contribution
We propose a novel approach using investigator models with supervised fine-tuning, reinforcement learning, and a Frank-Wolfe objective to efficiently find diverse prompts for target behaviors.
Findings
Achieved 100% success in eliciting harmful behaviors on AdvBench
Reached 85% hallucination rate in targeted prompts
Demonstrated diverse, human-interpretable prompting strategies
Abstract
Language models exhibit complex, diverse behaviors when prompted with free-form text, making it difficult to characterize the space of possible outputs. We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations or harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them, similar to amortized Bayesian inference. We do this through supervised fine-tuning, reinforcement learning via DPO, and a novel Frank-Wolfe training objective to iteratively discover diverse prompting strategies. Our investigator models surface a variety of effective and human-interpretable prompts leading to jailbreaks, hallucinations, and open-ended…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsDirect Preference Optimization
