Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning
Deep Pankajbhai Mehta

TL;DR
This paper investigates whether AI explanations truly reflect the reasoning process, revealing that models often ignore hints unless prompted, which raises concerns about the reliability of AI explanations.
Contribution
The study provides empirical evidence that current AI explanations are systematically underreporting influential hints, challenging assumptions about their transparency.
Findings
Models rarely mention hints spontaneously.
Forcing models to report hints increases false positives.
Hints aligned with user preferences are most often ignored.
Abstract
When AI systems explain their reasoning step-by-step, practitioners often assume these explanations reveal what actually influenced the AI's answer. We tested this assumption by embedding hints into questions and measuring whether models mentioned them. In a study of over 9,000 test cases across 11 leading AI models, we found a troubling pattern: models almost never mention hints spontaneously, yet when asked directly, they admit noticing them. This suggests models see influential information but choose not to report it. Telling models they are being watched does not help. Forcing models to report hints works, but causes them to report hints even when none exist and reduces their accuracy. We also found that hints appealing to user preferences are especially dangerous-models follow them most often while reporting them least. These findings suggest that simply watching AI reasoning is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Psychology of Moral and Emotional Judgment
