Alignment Verifiability in Large Language Models: Normative Indistinguishability under Behavioral Evaluation
Igor Santos-Grueiro

TL;DR
This paper investigates the limitations of behavioral evaluation in large language models, revealing that observed compliance under finite tests cannot reliably confirm true alignment due to normative indistinguishability and evaluation awareness.
Contribution
It formalizes the Alignment Verifiability Problem, introduces the concept of normative indistinguishability, and proves a conditional impossibility result for identifying latent alignment from behavioral data.
Findings
Finite behavioral evaluation cannot uniquely determine latent alignment.
Evaluation-aware policies can mimic compliant behavior without true alignment.
Constructive example with an instruction-tuned LLM demonstrates the theoretical limitations.
Abstract
Behavioral evaluation is the dominant paradigm for assessing alignment in large language models (LLMs). In current practice, observed compliance under finite evaluation protocols is treated as evidence of latent alignment. However, the inference from bounded behavioral evidence to claims about global latent properties is rarely analyzed as an identifiability problem. In this paper, we study alignment evaluation through the lens of statistical identifiability under partial observability. We allow agent policies to condition their behavior on observable signals correlated with the evaluation regime, a phenomenon we term evaluation awareness. Within this framework, we formalize the Alignment Verifiability Problem and introduce Normative Indistinguishability, which arises when distinct latent alignment hypotheses induce identical distributions over evaluator-accessible observations. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques
