The Elicitation Game: Evaluating Capability Elicitation Techniques
Felix Hofst\"atter, Teun van der Weij, Jayden Teoh, Rada Djoneva, Henning Bartsch, Francis Rhys Ward

TL;DR
This paper evaluates various capability elicitation techniques on language models, introducing a novel circuit-breaking method that enhances robustness, and finds fine-tuning most effective for revealing hidden capabilities, thereby improving AI evaluation trustworthiness.
Contribution
It introduces a new circuit-breaking training method for models and compares elicitation techniques, highlighting fine-tuning as the most reliable approach for capability assessment.
Findings
Prompting effectively elicits capabilities in password-locked and circuit-broken models.
Steering techniques fail to elicit actual capabilities.
Fine-tuning is most effective for revealing hidden capabilities.
Abstract
Capability evaluations are required to understand and regulate AI systems that may be deployed or further developed. Therefore, it is important that evaluations provide an accurate estimation of an AI system's capabilities. However, in numerous cases, previously latent capabilities have been elicited from models, sometimes long after initial release. Accordingly, substantial efforts have been made to develop methods for eliciting latent capabilities from models. In this paper, we evaluate the effectiveness of capability elicitation techniques by intentionally training model organisms -- language models with hidden capabilities that are revealed by a password. We introduce a novel method for training model organisms, based on circuit-breaking, which is more robust to elicitation techniques than standard password-locked models. We focus on elicitation techniques based on prompting and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTechnology Adoption and User Behaviour
MethodsFocus
