RUST-BENCH: Benchmarking LLM Reasoning on Unstructured Text within Structured Tables

Nikhil Abhyankar; Purvi Chaurasia; Sanchit Kabra; Ananya Srivastava; Vivek Gupta; Chandan K. Reddy

arXiv:2511.04491·cs.CL·November 7, 2025

RUST-BENCH: Benchmarking LLM Reasoning on Unstructured Text within Structured Tables

Nikhil Abhyankar, Purvi Chaurasia, Sanchit Kabra, Ananya Srivastava, Vivek Gupta, Chandan K. Reddy

PDF

Open Access

TL;DR

RUST-BENCH is a comprehensive benchmark with nearly 8,000 questions from real-world tables designed to evaluate large language models' reasoning abilities across complex, heterogeneous, and domain-specific data, highlighting current limitations.

Contribution

This work introduces RUST-BENCH, a large-scale, real-world table reasoning benchmark that challenges LLMs with heterogeneity, scale, and multi-hop inference, filling a gap in existing benchmarks.

Findings

01

LLMs struggle with heterogeneous schemas and multi-hop reasoning

02

Current models show weaknesses in complex, real-world table reasoning

03

RUST-BENCH provides a new challenging testbed for future research

Abstract

Existing tabular reasoning benchmarks mostly test models on small, uniform tables, underrepresenting the complexity of real-world data and giving an incomplete view of Large Language Models' (LLMs) reasoning abilities. Real tables are long, heterogeneous, and domain-specific, mixing structured fields with free text and requiring multi-hop reasoning across thousands of tokens. To address this gap, we introduce RUST-BENCH, a benchmark of 7966 questions from 2031 real-world tables spanning two domains: i) RB-Science (NSF grant records) and ii) RB-Sports (NBA statistics). Unlike prior work, RUST-BENCH evaluates LLMs jointly across scale, heterogeneity, domain specificity, and reasoning complexity. Experiments with open-source and proprietary models show that LLMs struggle with heterogeneous schemas and complex multi-hop inference, revealing persistent weaknesses in current architectures and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Data Quality and Management · Computational and Text Analysis Methods