CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching

Yuzhe Wang; Yaochen Zhu; Jundong Li

arXiv:2602.20094·cs.AI·February 24, 2026

CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching

Yuzhe Wang, Yaochen Zhu, Jundong Li

PDF

Open Access

TL;DR

This paper introduces CausalFlip, a benchmark for evaluating and improving large language models' ability to perform true causal reasoning beyond semantic pattern matching, using specially designed questions and evaluation methods.

Contribution

The paper proposes CausalFlip, a novel causal reasoning benchmark with adversarial question pairs and noisy-prefix evaluation, and assesses different training paradigms to enhance causal reasoning in LLMs.

Findings

01

Explicit Chain-of-Thought can be misled by semantic correlations.

02

Internalized causal reasoning improves causal grounding.

03

Models trained with internalized reasoning outperform answer-only and explicit CoT methods.

Abstract

As large language models (LLMs) witness increasing deployment in complex, high-stakes decision-making scenarios, it becomes imperative to ground their reasoning in causality rather than spurious correlations. However, strong performance on traditional reasoning benchmarks does not guarantee true causal reasoning ability of LLMs, as high accuracy may still arise from memorizing semantic patterns instead of analyzing the underlying true causal structures. To bridge this critical gap, we propose a new causal reasoning benchmark, CausalFlip, designed to encourage the development of new LLM paradigm or training algorithms that ground LLM reasoning in causality rather than semantic correlation. CausalFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations. Based on this, for each event triple, we construct pairs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)