Stress-Testing Capability Elicitation With Password-Locked Models

Ryan Greenblatt; Fabien Roger; Dmitrii Krasheninnikov; David Krueger

arXiv:2405.19550·cs.LG·May 31, 2024

Stress-Testing Capability Elicitation With Password-Locked Models

Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov, David Krueger

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces password-locked models to evaluate the effectiveness of fine-tuning in eliciting hidden capabilities of large language models, revealing that fine-tuning can often unlock capabilities even without explicit prompts.

Contribution

The study presents a novel password-locked model framework to assess capability elicitation and demonstrates fine-tuning's ability to unlock hidden capabilities under various conditions.

Findings

01

Few high-quality demonstrations can fully elicit locked capabilities.

02

Fine-tuning can unlock capabilities across different passwords.

03

Reinforcement learning can sometimes elicit capabilities without demonstrations.

Abstract

To determine the safety of large language models (LLMs), AI developers must be able to assess their dangerous capabilities. But simple prompting strategies often fail to elicit an LLM's full capabilities. One way to elicit capabilities more robustly is to fine-tune the LLM to complete the task. In this paper, we investigate the conditions under which fine-tuning-based elicitation suffices to elicit capabilities. To do this, we introduce password-locked models, LLMs fine-tuned such that some of their capabilities are deliberately hidden. Specifically, these LLMs are trained to exhibit these capabilities only when a password is present in the prompt, and to imitate a much weaker LLM otherwise. Password-locked models enable a novel method of evaluating capabilities elicitation methods, by testing whether these password-locked capabilities can be elicited without using the password. We find…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

FabienRoger/sandbagging
noneOfficial

Videos

Stress-Testing Capability Elicitation With Password-Locked Models· slideslive

Taxonomy

TopicsFault Detection and Control Systems · Industrial Vision Systems and Defect Detection