TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models

Ce Li; Xiaofan Liu; Zhiyan Song; Ce Chi; Chen Zhao; Jingjing Yang; Zhendong Wang; Kexin Yang; Boshen Shi; Xing Wang; Chao Deng; Junlan Feng

arXiv:2506.18421·cs.CL·July 15, 2025

TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models

Ce Li, Xiaofan Liu, Zhiyan Song, Ce Chi, Chen Zhao, Jingjing Yang, Zhendong Wang, Kexin Yang, Boshen Shi, Xing Wang, Chao Deng, Junlan Feng

PDF

1 Models 3 Datasets

TL;DR

This paper introduces TReB, a comprehensive benchmark with 26 sub-tasks for evaluating large language models' ability to reason with table-structured data, addressing a critical gap in performance assessment.

Contribution

The paper presents a new benchmark and evaluation framework for table reasoning, including a high-quality dataset and testing across 20+ LLMs, highlighting the need for improved reasoning capabilities.

Findings

01

Existing LLMs show significant room for improvement in table reasoning.

02

The benchmark effectively differentiates LLM performance on various table reasoning tasks.

03

The dataset and framework are publicly available for further research.

Abstract

The majority of data in businesses and industries is stored in tables, databases, and data warehouses. Reasoning with table-structured data poses significant challenges for large language models (LLMs) due to its hidden semantics, inherent complexity, and structured nature. One of these challenges is lacking an effective evaluation benchmark fairly reflecting the performances of LLMs on broad table reasoning abilities. In this paper, we fill in this gap, presenting a comprehensive table reasoning evolution benchmark, TReB, which measures both shallow table understanding abilities and deep table reasoning abilities, a total of 26 sub-tasks. We construct a high quality dataset through an iterative data processing procedure. We create an evaluation framework to robustly measure table reasoning capabilities with three distinct inference modes, TCoT, PoT and ICoT. Further, we benchmark over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
JT-LM/JT-DA-8B
model· 3 dl· ♡ 2
3 dl♡ 2

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.