Are LLMs Capable of Data-based Statistical and Causal Reasoning?   Benchmarking Advanced Quantitative Reasoning with Data

Xiao Liu; Zirui Wu; Xueqing Wu; Pan Lu; Kai-Wei Chang; Yansong Feng

arXiv:2402.17644·cs.CL·June 11, 2024·1 cites

Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data

Xiao Liu, Zirui Wu, Xueqing Wu, Pan Lu, Kai-Wei Chang, Yansong Feng

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces the QRData benchmark to evaluate large language models' abilities in statistical and causal reasoning with real-world data, revealing significant challenges and room for improvement.

Contribution

The paper presents the first comprehensive benchmark, QRData, for assessing LLMs' quantitative reasoning with data and introduces diverse reasoning methods and evaluations.

Findings

01

GPT-4 achieves 58% accuracy on QRData

02

Open-source models reach up to 37% accuracy

03

Models struggle with causal reasoning and data analysis

Abstract

Quantitative reasoning is a critical skill to analyze data, yet the assessment of such ability remains limited. To address this gap, we introduce the Quantitative Reasoning with Data (QRData) benchmark, aiming to evaluate Large Language Models' capability in statistical and causal reasoning with real-world data. The benchmark comprises a carefully constructed dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers. To compare models' quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText. We evaluate natural language reasoning, program-based reasoning, and agent reasoning methods including Chain-of-Thought, Program-of-Thoughts, ReAct, and code interpreter assistants on diverse models. The strongest model GPT-4 achieves an accuracy of 58%, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xxxiaol/qrdata
noneOfficial

Videos

Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data· underline

Taxonomy

TopicsStatistical and Computational Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification

MethodsSparse Evolutionary Training · Linear Layer · Dropout · Layer Normalization · Byte Pair Encoding · Multi-Head Attention · Dense Connections · Label Smoothing · Adam · Attention Is All You Need