CLEAR-3K: Assessing Causal Explanatory Capabilities in Language Models
Naiming Liu, Richard Baraniuk, Shashank Sonkar

TL;DR
CLEAR-3K introduces a dataset to evaluate language models' ability to distinguish true causal explanations from mere semantic relatedness, revealing current models' limitations in causal reasoning despite increasing size.
Contribution
The paper presents a new benchmark dataset, CLEAR-3K, for assessing causal explanatory reasoning in language models, highlighting their tendency to confuse causality with semantic similarity.
Findings
Models often confuse semantic similarity with causality.
Larger models shift from skepticism to over-permissiveness in causal judgments.
Performance plateau at MCC of 0.55 even for largest models.
Abstract
We introduce CLEAR-3K, a dataset of 3,000 assertion-reasoning questions designed to evaluate whether language models can determine if one statement causally explains another. Each question present an assertion-reason pair and challenge language models to distinguish between semantic relatedness and genuine causal explanatory relationships. Through comprehensive evaluation of 21 state-of-the-art language models (ranging from 0.5B to 72B parameters), we identify two fundamental findings. First, language models frequently confuse semantic similarity with causality, relying on lexical and semantic overlap instead of inferring actual causal explanatory relationships. Second, as parameter size increases, models tend to shift from being overly skeptical about causal relationships to being excessively permissive in accepting them. Despite this shift, performance measured by the Matthews…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
