Silent Commitment Failure in Instruction-Tuned Language Models: Evidence of Governability Divergence Across Architectures
Gregory M. Ruddell

TL;DR
This paper investigates the ability of instruction-tuned language models to detect and correct their errors before output commitment, revealing significant differences across architectures and the limitations of current benchmarks.
Contribution
It introduces the concept of governability, demonstrates its variability across models, and proposes a classification matrix for model-task regimes based on error detectability and correction.
Findings
Two of three instruction-following models exhibit silent commitment failure.
Benchmark accuracy does not predict a model's governability.
Governability appears to be fixed at pretraining, not easily altered by fine-tuning.
Abstract
As large language models are deployed as autonomous agents with tool execution privileges, a critical assumption underpins their security architecture: that model errors are detectable at runtime. We present empirical evidence that this assumption fails for two of three instruction-following models evaluable for conflict detection. We introduce governability -- the degree to which a model's errors are detectable before output commitment and correctable once detected -- and demonstrate it varies dramatically across models. In six models across twelve reasoning domains, two of three instruction-following models exhibited silent commitment failure: confident, fluent, incorrect output with zero warning signal. The remaining model produced a detectable conflict signal 57 tokens before commitment under greedy decoding. We show benchmark accuracy does not predict governability, correction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecurity and Verification in Computing · Software Engineering Research · Access Control and Trust
