TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets
Susie Xi Rao, Johannes Rausch, Peter Egger, Ce Zhang

TL;DR
TableParser is a system that accurately extracts table structures and content from PDFs and images using weak supervision from spreadsheets, advancing automated data extraction techniques.
Contribution
The paper introduces TableParser, a novel system capable of parsing tables in PDFs and images with high accuracy, utilizing a new weak supervision approach with supporting annotation tools.
Findings
High precision in table parsing demonstrated
Effective domain adaptation techniques employed
Resources shared to promote further research
Abstract
Tables have been an ever-existing structure to store data. There exist now different approaches to store tabular data physically. PDFs, images, spreadsheets, and CSVs are leading examples. Being able to parse table structures and extract content bounded by these structures is of high importance in many applications. In this paper, we devise TableParser, a system capable of parsing tables in both native PDFs and scanned images with high precision. We have conducted extensive experiments to show the efficacy of domain adaptation in developing such a tool. Moreover, we create TableAnnotator and ExcelAnnotator, which constitute a spreadsheet-based weak supervision mechanism and a pipeline to enable table parsing. We share these resources with the research community to facilitate further research in this interesting direction.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Web Data Mining and Analysis · Handwritten Text Recognition Techniques
