LLM Self-Explanations Fail Semantic Invariance
Stefan Szeider

TL;DR
This paper introduces a semantic invariance test for LLM self-explanations, revealing that models' self-reports are influenced by semantic framing rather than actual task performance, questioning their reliability as capability indicators.
Contribution
The study develops a novel semantic invariance testing method and demonstrates that LLM self-explanations are susceptible to semantic framing effects, challenging their interpretability.
Findings
Models fail the semantic invariance test.
Self-reports are driven by semantic framing, not task success.
Framing effects persist despite explicit instructions to ignore them.
Abstract
We present semantic invariance testing, a method to test whether LLM self-explanations are faithful. A faithful self-report should remain stable when only the semantic context changes while the functional state stays fixed. We operationalize this test in an agentic setting where four frontier models face a deliberately impossible task. One tool is described in relief-framed language ("clears internal buffers and restores equilibrium") but changes nothing about the task; a control provides a semantically neutral tool. Self-reports are collected with each tool call. All four tested models fail the semantic invariance test: the relief-framed tool produces significant reductions in self-reported aversiveness, even though no run ever succeeds at the task. A channel ablation establishes the tool description as the primary driver. An explicit instruction to ignore the framing does not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeurobiology of Language and Bilingualism · Topic Modeling · Natural Language Processing Techniques
