CTE: A Dataset for Contextualized Table Extraction

Andrea Gemelli; Emanuele Vivoli; Simone Marinai

arXiv:2302.01451·cs.CL·February 14, 2023

CTE: A Dataset for Contextualized Table Extraction

Andrea Gemelli, Emanuele Vivoli, Simone Marinai

PDF

Open Access 1 Repo

TL;DR

The paper introduces CTE, a large dataset for the task of Contextualized Table Extraction, enabling unified table structure and context understanding in scientific documents.

Contribution

It provides a new dataset with comprehensive annotations for CTE, supporting multiple table-related tasks in a unified framework.

Findings

01

Dataset contains 75k annotated pages with 35k tables.

02

Supports end-to-end pipelines for layout analysis and table understanding.

03

Defines CTE task and evaluation metrics.

Abstract

Relevant information in documents is often summarized in tables, helping the reader to identify useful facts. Most benchmark datasets support either document layout analysis or table understanding, but lack in providing data to apply both tasks in a unified way. We define the task of Contextualized Table Extraction (CTE), which aims to extract and define the structure of tables considering the textual context of the document. The dataset comprises 75k fully annotated pages of scientific papers, including more than 35k tables. Data are gathered from PubMed Central, merging the information provided by annotations in the PubTables-1M and PubLayNet datasets. The dataset can support CTE and adds new classes to the original ones. The generated annotations can be used to develop end-to-end pipelines for various tasks, including document layout analysis, table detection, structure recognition,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ailab-unifi/cte-dataset
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Topic Modeling · Biomedical Text Mining and Ontologies