Language Models Can Predict Their Own Behavior
Dhananjay Ashok, Jonathan May

TL;DR
This paper introduces conformal probes that predict language model behaviors early in computation, enabling preemptive detection of failures and reducing inference costs without sacrificing accuracy.
Contribution
The authors develop provably reliable conformal probes that predict LM behaviors before token generation, improving safety and efficiency in deployment.
Findings
Probes can predict alignment failures before token generation.
Early warning reduces jailbreaking by 91%.
Probes cut inference costs by 65% with minimal accuracy loss.
Abstract
The text produced by language models (LMs) can exhibit specific `behaviors,' such as a failure to follow alignment training, that we hope to detect and react to during deployment. Identifying these behaviors can often only be done post facto, i.e., after the entire text of the output has been generated. We provide evidence that there are times when we can predict how an LM will behave early in computation, before even a single token is generated. We show that probes trained on the internal representation of input tokens alone can predict a wide range of eventual behaviors over the entire output sequence. Using methods from conformal prediction, we provide provable bounds on the estimation error of our probes, creating precise early warning systems for these behaviors. The conformal probes can identify instances that will trigger alignment failures (jailbreaking) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLanguage and cultural evolution
