Benchmarking Table Extraction from Heterogeneous Scientific Extraction Documents
Marijan Soric, C\'ecile Gracianne, Ioana Manolescu, Pierre Senellart

TL;DR
This paper introduces a comprehensive benchmark for table extraction from PDFs, evaluating various models on heterogeneous datasets to highlight current challenges in generalizability, robustness, and interpretability.
Contribution
It presents a new benchmark with datasets, evaluation metrics, and analysis tools for end-to-end table extraction from scientific documents.
Findings
Current TE methods lack generalizability on heterogeneous data
TE approaches face robustness and interpretability limitations
Benchmark provides a standardized evaluation framework
Abstract
Table Extraction (TE) consists in extracting tables from PDF documents, in a structured format which can be automatically processed. While numerous TE tools exist, the variety of methods and techniques makes it difficult for users to choose an appropriate one. We propose a novel benchmark for assessing end-to-end TE methods (from PDF to the final table). We contribute an analysis of TE evaluation metrics, and the design of a rigorous evaluation process, which allows scoring each TE sub-task as well as end-to-end TE, and captures model uncertainty. Along with a prior dataset, our benchmark comprises two new heterogeneous datasets of 37k samples. We run our benchmark on diverse models, including off-the-shelf libraries, software tools, large vision language models, and approaches based on computer vision. The results demonstrate that TE remains challenging: current methods suffer from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Web Data Mining and Analysis · Data Quality and Management
