TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering
An-Yang Ji, Jun-Peng Jiang, De-Chuan Zhan, Han-Jia Ye

TL;DR
TopBench is a new benchmark designed to evaluate large language models on implicit prediction tasks in tabular question answering, highlighting current models' struggles with intent recognition and reasoning.
Contribution
The paper introduces TopBench, a comprehensive benchmark with 779 samples for assessing LLMs' ability to perform implicit prediction and reasoning over tables.
Findings
Current models often default to lookup instead of prediction.
Accurate intent recognition is crucial for predictive reasoning.
Enhancing prediction accuracy requires more sophisticated modeling.
Abstract
Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation. However, a common class of real-world queries is implicitly predictive, requiring the inference of unobserved answers from historical patterns rather than mere retrieval. These queries introduce two challenges: recognizing latent intent and reliable predictive reasoning over massive tables. To assess LLMs in such Tabular questiOn answering with implicit Prediction tasks, we introduce TopBench, a benchmark consisting of 779 samples across four sub-tasks, ranging from single-point prediction to decision making, treatment effect analysis, and complex filtering, requiring models to generate outputs spanning reasoning text and structured tables. We evaluate diverse models under both text-based and agentic workflows. Experiments reveal that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
