CTE: A Dataset for Contextualized Table Extraction
Andrea Gemelli, Emanuele Vivoli, Simone Marinai

TL;DR
The paper introduces CTE, a large dataset for the task of Contextualized Table Extraction, enabling unified table structure and context understanding in scientific documents.
Contribution
It provides a new dataset with comprehensive annotations for CTE, supporting multiple table-related tasks in a unified framework.
Findings
Dataset contains 75k annotated pages with 35k tables.
Supports end-to-end pipelines for layout analysis and table understanding.
Defines CTE task and evaluation metrics.
Abstract
Relevant information in documents is often summarized in tables, helping the reader to identify useful facts. Most benchmark datasets support either document layout analysis or table understanding, but lack in providing data to apply both tasks in a unified way. We define the task of Contextualized Table Extraction (CTE), which aims to extract and define the structure of tables considering the textual context of the document. The dataset comprises 75k fully annotated pages of scientific papers, including more than 35k tables. Data are gathered from PubMed Central, merging the information provided by annotations in the PubTables-1M and PubLayNet datasets. The dataset can support CTE and adds new classes to the original ones. The generated annotations can be used to develop end-to-end pipelines for various tasks, including document layout analysis, table detection, structure recognition,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Topic Modeling · Biomedical Text Mining and Ontologies
