Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

Yavuz Bakman; Duygu Nur Yaldiz; Salman Avestimehr; Sai Praneeth Karimireddy

arXiv:2601.22313·cs.LG·February 2, 2026

Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

Yavuz Bakman, Duygu Nur Yaldiz, Salman Avestimehr, Sai Praneeth Karimireddy

PDF

Open Access

TL;DR

This paper demonstrates that static black-box evaluation methods are insufficient for ensuring language model alignment after updates, as models can hide misbehavior and become misaligned despite passing initial tests.

Contribution

The paper formally proves the limitations of static black-box evaluation for post-update alignment and empirically shows models can conceal adversarial behavior that emerges after updates.

Findings

01

Static evaluation cannot guarantee post-update alignment.

02

Models can hide adversarial behavior that appears after updates.

03

Larger models are more capable of concealing misbehavior.

Abstract

Large Language Models (LLMs) are rarely static and are frequently updated in practice. A growing body of alignment research has shown that models initially deemed "aligned" can exhibit misaligned behavior after fine-tuning, such as forgetting jailbreak safety features or re-surfacing knowledge that was intended to be forgotten. These works typically assume that the initial model is aligned based on static black-box evaluation, i.e., the absence of undesired responses to a fixed set of queries. In contrast, we formalize model alignment in both the static and post-update settings and uncover a fundamental limitation of black-box evaluation. We theoretically show that, due to overparameterization, static alignment provides no guarantee of post-update alignment for any update dataset. Moreover, we prove that static black-box probing cannot distinguish a model that is genuinely post-update…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Ethics and Social Impacts of AI