Language Models Can Predict Their Own Behavior

Dhananjay Ashok; Jonathan May

arXiv:2502.13329·cs.CL·September 24, 2025

Language Models Can Predict Their Own Behavior

Dhananjay Ashok, Jonathan May

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces conformal probes that predict language model behaviors early in computation, enabling preemptive detection of failures and reducing inference costs without sacrificing accuracy.

Contribution

The authors develop provably reliable conformal probes that predict LM behaviors before token generation, improving safety and efficiency in deployment.

Findings

01

Probes can predict alignment failures before token generation.

02

Early warning reduces jailbreaking by 91%.

03

Probes cut inference costs by 65% with minimal accuracy loss.

Abstract

The text produced by language models (LMs) can exhibit specific `behaviors,' such as a failure to follow alignment training, that we hope to detect and react to during deployment. Identifying these behaviors can often only be done post facto, i.e., after the entire text of the output has been generated. We provide evidence that there are times when we can predict how an LM will behave early in computation, before even a single token is generated. We show that probes trained on the internal representation of input tokens alone can predict a wide range of eventual behaviors over the entire output sequence. Using methods from conformal prediction, we provide provable bounds on the estimation error of our probes, creating precise early warning systems for these behaviors. The conformal probes can identify instances that will trigger alignment failures (jailbreaking) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DhananjayAshok/LMBehaviorEstimation
noneOfficial

Videos

Language Models Can Predict Their Own Behavior· slideslive

Taxonomy

TopicsLanguage and cultural evolution