TabLeX: A Benchmark Dataset for Structure and Content Information   Extraction from Scientific Tables

Harsh Desai; Pratik Kayal; Mayank Singh

arXiv:2105.06400·cs.IR·September 7, 2021

TabLeX: A Benchmark Dataset for Structure and Content Information Extraction from Scientific Tables

Harsh Desai, Pratik Kayal, Mayank Singh

PDF

TL;DR

TabLeX is a comprehensive benchmark dataset designed to advance the development of table information extraction tools from scientific articles, addressing current model limitations with diverse, annotated table images.

Contribution

This paper introduces TabLeX, a large-scale, annotated dataset with diverse scientific table images and LATEX sources, facilitating robust evaluation and development of table IE models.

Findings

01

Current state-of-the-art models perform poorly on simple table images.

02

Transformer-based baseline shows limited performance, highlighting room for improvement.

03

Dataset will be expanded with more complex tables over time.

Abstract

Information Extraction (IE) from the tables present in scientific articles is challenging due to complicated tabular representations and complex embedded text. This paper presents TabLeX, a large-scale benchmark dataset comprising table images generated from scientific articles. TabLeX consists of two subsets, one for table structure extraction and the other for table content extraction. Each table image is accompanied by its corresponding LATEX source code. To facilitate the development of robust table IE tools, TabLeX contains images in different aspect ratios and in a variety of fonts. Our analysis sheds light on the shortcomings of current state-of-the-art table extraction models and shows that they fail on even simple table images. Towards the end, we experiment with a transformer-based existing baseline to report performance scores. In contrast to the static benchmarks, we plan to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.