CLEAR-3K: Assessing Causal Explanatory Capabilities in Language Models

Naiming Liu; Richard Baraniuk; Shashank Sonkar

arXiv:2506.17180·cs.CL·June 23, 2025

CLEAR-3K: Assessing Causal Explanatory Capabilities in Language Models

Naiming Liu, Richard Baraniuk, Shashank Sonkar

PDF

Open Access 1 Video

TL;DR

CLEAR-3K introduces a dataset to evaluate language models' ability to distinguish true causal explanations from mere semantic relatedness, revealing current models' limitations in causal reasoning despite increasing size.

Contribution

The paper presents a new benchmark dataset, CLEAR-3K, for assessing causal explanatory reasoning in language models, highlighting their tendency to confuse causality with semantic similarity.

Findings

01

Models often confuse semantic similarity with causality.

02

Larger models shift from skepticism to over-permissiveness in causal judgments.

03

Performance plateau at MCC of 0.55 even for largest models.

Abstract

We introduce CLEAR-3K, a dataset of 3,000 assertion-reasoning questions designed to evaluate whether language models can determine if one statement causally explains another. Each question present an assertion-reason pair and challenge language models to distinguish between semantic relatedness and genuine causal explanatory relationships. Through comprehensive evaluation of 21 state-of-the-art language models (ranging from 0.5B to 72B parameters), we identify two fundamental findings. First, language models frequently confuse semantic similarity with causality, relying on lexical and semantic overlap instead of inferring actual causal explanatory relationships. Second, as parameter size increases, models tend to shift from being overly skeptical about causal relationships to being excessively permissive in accepting them. Despite this shift, performance measured by the Matthews…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CLEAR-3K: Assessing Causal Explanatory Capabilities in Language Models· underline

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications