SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Sungho Park, Jueun Kim, Wook-Shin Han

TL;DR
SPARTA is a scalable framework for automatically generating large, high-quality multi-hop table-text question answering benchmarks that reveal weaknesses in current models' reasoning abilities.
Contribution
The paper introduces SPARTA, a novel automated framework for creating extensive, realistic multi-hop QA datasets over tables and text, with techniques ensuring question fluency and logical correctness.
Findings
State-of-the-art models drop over 30 F1 points on SPARTA, indicating current weaknesses.
SPARTA significantly reduces annotation time compared to previous benchmarks.
Generated questions cover complex reasoning operations like aggregation and multi-hop inference.
Abstract
Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then…
Peer Reviews
Decision·ICLR 2026 Poster
Overall, this paper is very well written and I see great value in the proposed dataset. 1. The scope and scale of the dataset is seemingly quite novel. There is great care to the tested aspects of the dataset, and the overall size is sufficiently large for high-quality model comparison. 2. Each aspect of the dataset design is well motivated, and easy to interpret. This paper is interpretable for the general AI research reader. 3. The dataset evaluation is sufficient and provides several in
1. A primary weakness is the lack human-written and human-curated Q&A instances. While the synthetic generation methodology is impressive and well-evaluated, the authors could provide small curated data subsets for testing certain aspects of reasoning. For example, reasoning over ranges or negations, data inconsistencies (e.g. where text and tables have differing values for a specific fact), etc. 2. Similarly, While the scope of SPARTA is impressive and there is variation over structural, brea
1. Paper is well written and clear. Results demonstrate effectiveness of the dataset construction approach 2. Automated fixing and provenance-based fixing is a pretty creative idea 3. Substantial analysis of model failures on this dataset is also presented
1. seems to be missing the constructed dataset statistics? e.g. how many reference tables are constructed per source dataset? 2. scalability of this approach seems to be bottlenecked by number of source tables?
1. The paper is clear and easy to follow. 2. The analysis of existing benmarks is comprehensive. 3. Several experiments along with analysis on this benchmark.
1. **Excessive Dependence on High-Capacity LLMs for Pipeline Efficiency**. The efficiency of the Provenance-based Refinement loop relies heavily on the advanced reasoning capability of a large LLM (Llama-3.1-70B-Instruct) to accurately diagnose and correct erroneous SQL predicates. However, the paper does not analyze how the framework’s quality and cost-effectiveness (central to its “Scalable” claim) would be affected when using smaller or less capable LLMs. This raises concerns about the robust
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Natural Language Processing Techniques
