CausalT5K: Diagnosing and Informing Refusal for Trustworthy Causal Reasoning of Skepticism, Sycophancy, Detection-Correction, and Rung Collapse

Longling Geng; Andy Ouyang; Theodore Wu; Daphne Barretto; Matthew John Hayes; Rachael Cooper; Yuqiao Zeng; Sameer Vijay; Gia Ancone; Ankit Rai; Matthew Wolfman; Patrick Flanagan; Edward Y. Chang

arXiv:2602.08939·cs.AI·February 10, 2026

CausalT5K: Diagnosing and Informing Refusal for Trustworthy Causal Reasoning of Skepticism, Sycophancy, Detection-Correction, and Rung Collapse

Longling Geng, Andy Ouyang, Theodore Wu, Daphne Barretto, Matthew John Hayes, Rachael Cooper, Yuqiao Zeng, Sameer Vijay, Gia Ancone, Ankit Rai, Matthew Wolfman, Patrick Flanagan, Edward Y. Chang

PDF

Open Access

TL;DR

CausalT5K is a comprehensive benchmark designed to diagnose and improve large language models' causal reasoning by testing their ability to detect reasoning failures, resist manipulation, and generate appropriate refusals in realistic scenarios.

Contribution

It introduces a novel, human-verified benchmark with over 5,000 cases that systematically evaluates models' causal reasoning capabilities across multiple failure modes.

Findings

01

Models struggle with rung collapse detection.

02

Sycophantic drift is difficult to resist under adversarial conditions.

03

Static audit policies are ineffective in controlling model behavior.

Abstract

LLM failures in causal reasoning, including sycophancy, rung collapse, and miscalibrated refusal, are well-documented, yet progress on remediation is slow because no benchmark enables systematic diagnosis. We introduce CausalT5K, a diagnostic benchmark of over 5,000 cases across 10 domains that tests three critical capabilities: (1) detecting rung collapse, where models answer interventional queries with associational evidence; (2) resisting sycophantic drift under adversarial pressure; and (3) generating Wise Refusals that specify missing information when evidence is underdetermined. Unlike synthetic benchmarks, CausalT5K embeds causal traps in realistic narratives and decomposes performance into Utility (sensitivity) and Safety (specificity), revealing failure modes invisible to aggregate accuracy. Developed through a rigorous human-machine collaborative pipeline involving 40 domain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks · Scientific Computing and Data Management