TableBank: A Benchmark Dataset for Table Detection and Recognition

Minghao Li; Lei Cui; Shaohan Huang; Furu Wei; Ming Zhou; Zhoujun Li

arXiv:1903.01949·cs.CV·July 7, 2020·30 cites

TableBank: A Benchmark Dataset for Table Detection and Recognition

Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, Zhoujun Li

PDF

Open Access 2 Repos 3 Datasets

TL;DR

TableBank is a large-scale dataset with 417,000 labeled tables, created using weak supervision from internet documents, to improve deep learning models for table detection and recognition.

Contribution

The paper introduces TableBank, a comprehensive and publicly available dataset for table detection and recognition, built with weak supervision from Word and Latex documents.

Findings

01

Strong baselines achieved using state-of-the-art models.

02

The dataset enables better generalization in real-world applications.

03

Public availability of dataset and models facilitates further research.

Abstract

We present TableBank, a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet. Existing research for image-based table detection and recognition usually fine-tunes pre-trained models on out-of-domain data with a few thousand human-labeled examples, which is difficult to generalize on real-world applications. With TableBank that contains 417K high quality labeled tables, we build several strong baselines using state-of-the-art models with deep neural networks. We make TableBank publicly available and hope it will empower more deep learning approaches in the table detection and recognition task. The dataset and models are available at \url{https://github.com/doc-analysis/TableBank}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Text and Document Classification Technologies · Multimodal Machine Learning Applications