Support-Contra Asymmetry in LLM Explanations

Avinash Patil

arXiv:2510.21884·cs.CL·April 3, 2026

Support-Contra Asymmetry in LLM Explanations

Avinash Patil

PDF

TL;DR

This study investigates how LLM explanations align with predictive lexical cues, revealing a pattern where correct predictions reference supporting evidence and incorrect ones reference contradicting cues.

Contribution

It introduces an empirical analysis of support-contra asymmetry in LLM explanations using external interpretable feature importance signals across multiple datasets.

Findings

01

Explanations for correct predictions reference more supporting cues.

02

Explanations for incorrect predictions reference more contradicting cues.

03

The support-contra asymmetry pattern is consistent across datasets and models.

Abstract

Large Language Models (LLMs) increasingly produce natural language explanations alongside their predictions, yet it remains unclear whether these explanations reference predictive cues present in the input text. In this work, we present an empirical study of how LLM-generated explanations align with predictive lexical evidence from an external model in text classification tasks. To analyze this relationship, we compare explanation content against interpretable feature importance signals extracted from transparent linear classifiers. These reference models allow us to partition predictive lexical cues into supporting and contradicting evidence relative to the predicted label. Across three benchmark datasets-WIKIONTOLOGY, AG NEWS, and IMDB-we observe a consistent empirical pattern that we term support-contra asymmetry. Explanations accompanying correct predictions tend to reference more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.