Can Large Language Models Infer Causation from Correlation?
Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan,, Rada Mihalcea, Mona Diab, Bernhard Sch\"olkopf

TL;DR
This paper introduces Corr2Cause, a large-scale benchmark dataset designed to evaluate the causal inference capabilities of large language models, revealing their current limitations in reasoning and generalization.
Contribution
The paper presents the first dataset specifically for testing LLMs' causal inference skills, highlighting their shortcomings and the challenges in improving their reasoning abilities.
Findings
LLMs perform near random on causal inference tasks
Finetuning improves in-distribution performance but not out-of-distribution generalization
Corr2Cause serves as a challenging benchmark for future research
Abstract
Causal inference is one of the hallmarks of human intelligence. While the field of CausalNLP has attracted much interest in the recent years, existing causal inference datasets in NLP primarily rely on discovering causality from empirical knowledge (e.g., commonsense knowledge). In this work, we propose the first benchmark dataset to test the pure causal inference skills of large language models (LLMs). Specifically, we formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We curate a large-scale dataset of more than 200K samples, on which we evaluate seventeen existing LLMs. Through our experiments, we identify a key shortcoming of LLMs in terms of their causal inference skills, and show that these models achieve almost close to random performance on the task. This shortcoming is somewhat mitigated…
Peer Reviews
Decision·ICLR 2024 poster
- The construction of the Corr2Cause benchmark is a strong attempt to disentangle "pure causal inference" capabilities of language models from potential construct validity threats. For example, avoiding potentially memorized scenarios, and avoiding a requirement of numerical analysis of existing data-driven benchmarks. - The Corr2Cause benchmark is evaluated on a broad set of language models. The additional experiments on fine-tuned models provide additional insight.
- LLM performance in many tasks is strongly sensitive to system prompt instructions framing the LLM's role and general task. Appendix A seems to indicate that there was no such "system prompt" used. The paper and its findings would be strengthened if it explored whether such instructions (such as stating the rules of causal inference or including few-shot examples) would help performance. - Similarly, different kinds of structured reasoning (e.g., chain of thought) have been found to improv
1. Differentiating causality from correlations is an important task and it is one of the clear limitations of LLMs. 2. The description of the construction process of the dataset is detailed, well-written, and sound. The background on basic causal inference can be useful for general readers. 3. The evaluation covers a wide range of models. Both traditional BERT-style NLI models and GPT-style LLMs are evaluated for this task.
1. I'm concerned about the difficulty of this task. This task can be very difficult for the models because it also involves multi-step reasoning, understanding symbolic systems, knowing the definition of all the terminologies, etc. Despite the clear motivation of evaluating the model's ability to do causal discovery, for me, it is unclear which part is actually the bottleneck of this task. From the current evaluation results, I am convinced that the models cannot do multi-step reasoning to figur
1. This paper design a novel corr2cause dataset to verify the causal inference ability of LLMs. 2. This paper provided a detailed dataset construction introduce, which makes the dataset very reasonable. 3. This paper conducted extensive experiments over 17 LLMs. Moreover, further experiments were also conducted to verify whether LLMs can learn causal inference ability by finetuning. The experiments are very convincing. 4. This paper made an early and interesting attempt over the causal inference
1. For the dataset construction process, the authors leverage variables to represent the input data. However, what do variables represent? Entity in sentence? sentences or what? The authors should provide more details information. 2. For the causality discovery, the authors only provided the results of PC algorithm. However, the precision of PC algorithm is unclear. Whether the results are reliable is still unclear. The author should provide more details about how to ensure the confidence of th
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks
Methodsfail
