TableParser: Automatic Table Parsing with Weak Supervision from   Spreadsheets

Susie Xi Rao; Johannes Rausch; Peter Egger; Ce Zhang

arXiv:2201.01654·cs.CV·January 6, 2022

TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets

Susie Xi Rao, Johannes Rausch, Peter Egger, Ce Zhang

PDF

Open Access 1 Repo

TL;DR

TableParser is a system that accurately extracts table structures and content from PDFs and images using weak supervision from spreadsheets, advancing automated data extraction techniques.

Contribution

The paper introduces TableParser, a novel system capable of parsing tables in PDFs and images with high accuracy, utilizing a new weak supervision approach with supporting annotation tools.

Findings

01

High precision in table parsing demonstrated

02

Effective domain adaptation techniques employed

03

Resources shared to promote further research

Abstract

Tables have been an ever-existing structure to store data. There exist now different approaches to store tabular data physically. PDFs, images, spreadsheets, and CSVs are leading examples. Being able to parse table structures and extract content bounded by these structures is of high importance in many applications. In this paper, we devise TableParser, a system capable of parsing tables in both native PDFs and scanned images with high precision. We have conducted extensive experiments to show the efficacy of domain adaptation in developing such a tool. Moreover, we create TableAnnotator and ExcelAnnotator, which constitute a spreadsheet-based weak supervision mechanism and a pipeline to enable table parsing. We share these resources with the research community to facilitate further research in this interesting direction.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ds3lab/tableparser
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Web Data Mining and Analysis · Handwritten Text Recognition Techniques