TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering

Junnan Zhu; Jingyi Wang; Bohan Yu; Xiaoyu Wu; Junbo Li; Lei Wang; Nan Xu

arXiv:2506.03949·cs.CL·September 23, 2025

TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering

Junnan Zhu, Jingyi Wang, Bohan Yu, Xiaoyu Wu, Junbo Li, Lei Wang, Nan Xu

PDF

Open Access 1 Datasets 1 Video

TL;DR

TableEval is a comprehensive benchmark designed to evaluate large language models on complex, multilingual, and multi-structured table question answering tasks, addressing real-world challenges and providing a new evaluation framework.

Contribution

The paper introduces TableEval, a realistic, multi-domain, multilingual TableQA benchmark with a novel semantic accuracy metric, SEAT, to better assess LLM performance on complex table reasoning.

Findings

01

State-of-the-art LLMs show significant gaps in complex TableQA tasks.

02

SEAT correlates highly with human judgment, improving evaluation accuracy.

03

Tables from diverse domains and languages reveal limitations of current models.

Abstract

LLMs have shown impressive progress in natural language processing. However, they still face significant challenges in TableQA, where real-world complexities such as diverse table structures, multilingual data, and domain-specific reasoning are crucial. Existing TableQA benchmarks are often limited by their focus on simple flat tables and suffer from data leakage. Furthermore, most benchmarks are monolingual and fail to capture the cross-lingual and cross-domain variability in practical applications. To address these limitations, we introduce TableEval, a new benchmark designed to evaluate LLMs on realistic TableQA tasks. Specifically, TableEval includes tables with various structures (such as concise, hierarchical, and nested tables) collected from four domains (including government, finance, academia, and industry reports). Besides, TableEval features cross-lingual scenarios with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

wenge-research/TableEval
dataset· 48 dl
48 dl

Videos

TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering· underline

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Text Readability and Simplification