Causal Evaluation of Language Models
Sirui Chen, Bo Peng, Meiqi Chen, Ruiqi Wang, Mengying Xu, Xingyu Zeng,, Rui Zhao, Shengjie Zhao, Yu Qiao, Chaochao Lu

TL;DR
This paper introduces CaLM, a comprehensive benchmark for evaluating the causal reasoning capabilities of language models, including a large dataset, evaluation framework, and analysis platform to guide future research.
Contribution
It presents the first systematic framework and dataset for assessing causal reasoning in language models, along with extensive evaluations and a community platform.
Findings
28 language models evaluated on 92 causal targets
50 empirical findings across 9 dimensions
CaLM platform supports ongoing research and updates
Abstract
Causal reasoning is viewed as crucial for achieving human-level machine intelligence. Recent advances in language models have expanded the horizons of artificial intelligence across various domains, sparking inquiries into their potential for causal reasoning. In this work, we introduce Causal evaluation of Language Models (CaLM), which, to the best of our knowledge, is the first comprehensive benchmark for evaluating the causal reasoning capabilities of language models. First, we propose the CaLM framework, which establishes a foundational taxonomy consisting of four modules: causal target (i.e., what to evaluate), adaptation (i.e., how to obtain the results), metric (i.e., how to measure the results), and error (i.e., how to analyze the bad results). This taxonomy defines a broad evaluation design space while systematically selecting criteria and priorities. Second, we compose the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Topic Modeling · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
