A Critical Review of Causal Reasoning Benchmarks for Large Language   Models

Linying Yang; Vik Shirvaikar; Oscar Clivio; Fabian Falck

arXiv:2407.08029·cs.LG·July 12, 2024

A Critical Review of Causal Reasoning Benchmarks for Large Language Models

Linying Yang, Vik Shirvaikar, Oscar Clivio, Fabian Falck

PDF

Open Access

TL;DR

This paper critically reviews existing benchmarks for evaluating causal reasoning in large language models, highlighting their limitations and proposing criteria for more effective assessment of causal understanding.

Contribution

It provides a comprehensive overview of current benchmarks, analyzes their effectiveness, and suggests a framework for developing better causal reasoning benchmarks for LLMs.

Findings

01

Many benchmarks can be solved via domain knowledge retrieval

02

Recent benchmarks incorporate interventional and counterfactual reasoning

03

Proposes criteria for effective causal reasoning benchmarks

Abstract

Numerous benchmarks aim to evaluate the capabilities of Large Language Models (LLMs) for causal inference and reasoning. However, many of them can likely be solved through the retrieval of domain knowledge, questioning whether they achieve their purpose. In this review, we present a comprehensive overview of LLM benchmarks for causality. We highlight how recent benchmarks move towards a more thorough definition of causal reasoning by incorporating interventional or counterfactual reasoning. We derive a set of criteria that a useful benchmark or set of benchmarks should aim to satisfy. We hope this work will pave the way towards a general framework for the assessment of causal understanding in LLMs and the design of novel benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multi-Agent Systems and Negotiation

MethodsSparse Evolutionary Training · Causal inference