T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables

Jie Zhang; Changzai Pan; Kaiwen Wei; Sishi Xiong; Yu Zhao; Xiangyu Li; Jiaxin Peng; Xiaoyan Gu; Jian Yang; Wenhan Chang; Zhenhe Wu; Jiang Zhong; Shuangyong Song; Yongxiang Li; Xuelong Li

arXiv:2508.19813·cs.CL·September 24, 2025

T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables

Jie Zhang, Changzai Pan, Kaiwen Wei, Sishi Xiong, Yu Zhao, Xiangyu Li, Jiaxin Peng, Xiaoyan Gu, Jian Yang, Wenhan Chang, Zhenhe Wu, Jiang Zhong, Shuangyong Song, Yongxiang Li, Xuelong Li

PDF

1 Datasets

TL;DR

This paper introduces T2R-bench, a comprehensive bilingual benchmark for evaluating large language models' ability to generate detailed reports from real-world industrial tables, highlighting current limitations.

Contribution

It presents a new benchmark with real industrial data and an evaluation framework to assess LLMs' performance in table-to-report generation tasks.

Findings

01

State-of-the-art LLMs score only 62.71, showing room for improvement.

02

The benchmark covers 19 industry domains and 4 table types.

03

Existing models struggle with complex, diverse industrial tables.

Abstract

Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Tele-AI/TeleTableBench
dataset· 203 dl
203 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.