Stress-Testing Capability Elicitation With Password-Locked Models
Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov, David Krueger

TL;DR
This paper introduces password-locked models to evaluate the effectiveness of fine-tuning in eliciting hidden capabilities of large language models, revealing that fine-tuning can often unlock capabilities even without explicit prompts.
Contribution
The study presents a novel password-locked model framework to assess capability elicitation and demonstrates fine-tuning's ability to unlock hidden capabilities under various conditions.
Findings
Few high-quality demonstrations can fully elicit locked capabilities.
Fine-tuning can unlock capabilities across different passwords.
Reinforcement learning can sometimes elicit capabilities without demonstrations.
Abstract
To determine the safety of large language models (LLMs), AI developers must be able to assess their dangerous capabilities. But simple prompting strategies often fail to elicit an LLM's full capabilities. One way to elicit capabilities more robustly is to fine-tune the LLM to complete the task. In this paper, we investigate the conditions under which fine-tuning-based elicitation suffices to elicit capabilities. To do this, we introduce password-locked models, LLMs fine-tuned such that some of their capabilities are deliberately hidden. Specifically, these LLMs are trained to exhibit these capabilities only when a password is present in the prompt, and to imitate a much weaker LLM otherwise. Password-locked models enable a novel method of evaluating capabilities elicitation methods, by testing whether these password-locked capabilities can be elicited without using the password. We find…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsFault Detection and Control Systems · Industrial Vision Systems and Defect Detection
