TableBank: A Benchmark Dataset for Table Detection and Recognition
Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, Zhoujun Li

TL;DR
TableBank is a large-scale dataset with 417,000 labeled tables, created using weak supervision from internet documents, to improve deep learning models for table detection and recognition.
Contribution
The paper introduces TableBank, a comprehensive and publicly available dataset for table detection and recognition, built with weak supervision from Word and Latex documents.
Findings
Strong baselines achieved using state-of-the-art models.
The dataset enables better generalization in real-world applications.
Public availability of dataset and models facilitates further research.
Abstract
We present TableBank, a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet. Existing research for image-based table detection and recognition usually fine-tunes pre-trained models on out-of-domain data with a few thousand human-labeled examples, which is difficult to generalize on real-world applications. With TableBank that contains 417K high quality labeled tables, we build several strong baselines using state-of-the-art models with deep neural networks. We make TableBank publicly available and hope it will empower more deep learning approaches in the table detection and recognition task. The dataset and models are available at \url{https://github.com/doc-analysis/TableBank}.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Text and Document Classification Technologies · Multimodal Machine Learning Applications
