CARL-GT: Evaluating Causal Reasoning Capabilities of Large Language   Models

Ruibo Tu; Hedvig Kjellstr\"om; Gustav Eje Henter; Cheng Zhang

arXiv:2412.17970·cs.CL·December 25, 2024

CARL-GT: Evaluating Causal Reasoning Capabilities of Large Language Models

Ruibo Tu, Hedvig Kjellstr\"om, Gustav Eje Henter, Cheng Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces CARL-GT, a comprehensive benchmark for evaluating the causal reasoning capabilities of large language models using diverse graph and tabular data tasks, revealing current limitations in LLMs' reasoning skills.

Contribution

The paper presents a novel benchmark, CARL-GT, specifically designed to assess causal reasoning in LLMs across multiple real-world relevant tasks and analyzes their performance and task relationships.

Findings

01

LLMs are weak in causal reasoning, especially with tabular data.

02

Performance varies across different tasks and categories.

03

Tasks in different categories show stronger correlations than those within the same category.

Abstract

Causal reasoning capabilities are essential for large language models (LLMs) in a wide range of applications, such as education and healthcare. But there is still a lack of benchmarks for a better understanding of such capabilities. Current LLM benchmarks are mainly based on conversational tasks, academic math tests, and coding tests. Such benchmarks evaluate LLMs in well-regularized settings, but they are limited in assessing the skills and abilities to solve real-world problems. In this work, we provide a benchmark, named by CARL-GT, which evaluates CAusal Reasoning capabilities of large Language models using Graphs and Tabular data. The benchmark has a diverse range of tasks for evaluating LLMs from causal graph reasoning, knowledge discovery, and decision-making aspects. In addition, effective zero-shot learning prompts are developed for the tasks. In our experiments, we leverage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

turuibo/cautabbench
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques