
TL;DR
This study investigates how LLM explanations align with predictive lexical cues, revealing a pattern where correct predictions reference supporting evidence and incorrect ones reference contradicting cues.
Contribution
It introduces an empirical analysis of support-contra asymmetry in LLM explanations using external interpretable feature importance signals across multiple datasets.
Findings
Explanations for correct predictions reference more supporting cues.
Explanations for incorrect predictions reference more contradicting cues.
The support-contra asymmetry pattern is consistent across datasets and models.
Abstract
Large Language Models (LLMs) increasingly produce natural language explanations alongside their predictions, yet it remains unclear whether these explanations reference predictive cues present in the input text. In this work, we present an empirical study of how LLM-generated explanations align with predictive lexical evidence from an external model in text classification tasks. To analyze this relationship, we compare explanation content against interpretable feature importance signals extracted from transparent linear classifiers. These reference models allow us to partition predictive lexical cues into supporting and contradicting evidence relative to the predicted label. Across three benchmark datasets-WIKIONTOLOGY, AG NEWS, and IMDB-we observe a consistent empirical pattern that we term support-contra asymmetry. Explanations accompanying correct predictions tend to reference more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
