How well do LLMs reason over tabular data, really?
Cornelius Wolff, Madelon Hulsebos

TL;DR
This paper investigates the true reasoning capabilities of large language models over tabular data, revealing significant performance deficits and robustness issues under realistic data variations, and proposes more reliable evaluation methods.
Contribution
It critically assesses existing evaluation strategies for LLMs on tabular reasoning, introduces improved assessment methods, and highlights the impact of real-world data variations on LLM performance.
Findings
LLMs show significant deficits in tabular reasoning performance.
Current evaluation metrics may overestimate LLM capabilities.
Robustness of LLMs decreases with realistic data variations.
Abstract
Large Language Models (LLMs) excel in natural language tasks, but less is known about their reasoning capabilities over tabular data. Prior analyses devise evaluation strategies that poorly reflect an LLM's realistic performance on tabular queries. Moreover, we have a limited understanding of the robustness of LLMs towards realistic variations in tabular inputs. Therefore, we ask: Can general-purpose LLMs reason over tabular data, really?, and focus on two questions 1) are tabular reasoning capabilities of general-purpose LLMs robust to real-world characteristics of tabular inputs, and 2) how can we realistically evaluate an LLM's performance on analytical tabular queries? Building on a recent tabular reasoning benchmark, we first surface shortcomings of its multiple-choice prompt evaluation strategy, as well as commonly used free-form text metrics such as SacreBleu and BERT-score. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Library Science and Information Systems
MethodsFocus
