Aletheia: Quantifying Cognitive Conviction in Reasoning Models via Regularized Inverse Confusion Matrix
Fanzhe Fu

TL;DR
This paper introduces Project Aletheia, a framework for quantifying the depth of belief in reasoning models using a regularized inverse confusion matrix, validated through synthetic data and applied to current AI benchmarks.
Contribution
It extends the CHOKE framework to measure 'Cognitive Conviction' in reasoning models and proposes a novel methodology employing Tikhonov Regularization for this purpose.
Findings
Models act as a 'cognitive buffer' against adversarial pressure.
Preliminary results indicate models may exhibit 'Defensive OverThinking'.
The Aligned Conviction Score helps ensure safety without compromising conviction.
Abstract
In the progressive journey toward Artificial General Intelligence (AGI), current evaluation paradigms face an epistemological crisis. Static benchmarks measure knowledge breadth but fail to quantify the depth of belief. While Simhi et al. (2025) defined the CHOKE phenomenon in standard QA, we extend this framework to quantify "Cognitive Conviction" in System 2 reasoning models. We propose Project Aletheia, a cognitive physics framework that employs Tikhonov Regularization to invert the judge's confusion matrix. To validate this methodology without relying on opaque private data, we implement a Synthetic Proxy Protocol. Our preliminary pilot study on 2025 baselines (e.g., DeepSeek-R1, OpenAI o1) suggests that while reasoning models act as a "cognitive buffer," they may exhibit "Defensive OverThinking" under adversarial pressure. Furthermore, we introduce the Aligned Conviction Score…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Adversarial Robustness in Machine Learning
