NoisyCausal: A Benchmark for Evaluating Causal Reasoning Under Structured Noise

Zhi Xu; Yun Fu

arXiv:2605.04313·cs.CL·May 7, 2026

NoisyCausal: A Benchmark for Evaluating Causal Reasoning Under Structured Noise

Zhi Xu, Yun Fu

PDF

TL;DR

NoisyCausal is a benchmark designed to evaluate and improve causal reasoning in language models under structured noise by integrating explicit causal structures with LLM prompting.

Contribution

The paper introduces NoisyCausal, a novel benchmark with a modular framework that combines causal graphs and language models to enhance reasoning robustness.

Findings

01

The proposed method outperforms standard prompting baselines.

02

It generalizes well to external benchmarks like Cladder.

03

Combining causal structures with LLM prompting improves interpretability.

Abstract

Causal reasoning in natural language requires identifying relevant variables, understanding their interactions, and reasoning about effects and interventions, often under noisy or ambiguous conditions. While large language models (LLMs) exhibit strong general reasoning abilities, they struggle to disentangle correlation from causation, particularly when observations are partially incorrect or irrelevant information is present. In this work, we introduce NoisyCausal, a new benchmark designed to evaluate causal reasoning under structured noise. Each instance is generated from a ground-truth causal graph and contextualized with a natural language scenario by injecting controllable forms of noise, such as irrelevant distractors, value perturbations, confounding, and partial observability. Moreover, we propose a modular reasoning framework that combines LLMs with explicit causal structure to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.