NoisyCausal: A Benchmark for Evaluating Causal Reasoning Under Structured Noise
Zhi Xu, Yun Fu

TL;DR
NoisyCausal is a benchmark designed to evaluate and improve causal reasoning in language models under structured noise by integrating explicit causal structures with LLM prompting.
Contribution
The paper introduces NoisyCausal, a novel benchmark with a modular framework that combines causal graphs and language models to enhance reasoning robustness.
Findings
The proposed method outperforms standard prompting baselines.
It generalizes well to external benchmarks like Cladder.
Combining causal structures with LLM prompting improves interpretability.
Abstract
Causal reasoning in natural language requires identifying relevant variables, understanding their interactions, and reasoning about effects and interventions, often under noisy or ambiguous conditions. While large language models (LLMs) exhibit strong general reasoning abilities, they struggle to disentangle correlation from causation, particularly when observations are partially incorrect or irrelevant information is present. In this work, we introduce NoisyCausal, a new benchmark designed to evaluate causal reasoning under structured noise. Each instance is generated from a ground-truth causal graph and contextualized with a natural language scenario by injecting controllable forms of noise, such as irrelevant distractors, value perturbations, confounding, and partial observability. Moreover, we propose a modular reasoning framework that combines LLMs with explicit causal structure to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
