PdfTable: A Unified Toolkit for Deep Learning-Based Table Extraction
Lei Sheng, Shuai-Shuai Xu

TL;DR
PdfTable is a comprehensive toolkit that integrates multiple models and tools to improve deep learning-based table extraction from diverse document formats, including PDFs and images, across various scenarios.
Contribution
It introduces a unified, adaptable toolkit combining recognition, OCR, and layout analysis models to handle diverse table extraction challenges in unstructured documents.
Findings
Effective on wired and wireless table datasets
Handles both digital and image-based PDFs
Open-source implementation available on GitHub
Abstract
Currently, a substantial volume of document data exists in an unstructured format, encompassing Portable Document Format (PDF) files and images. Extracting information from these documents presents formidable challenges due to diverse table styles, complex forms, and the inclusion of different languages. Several open-source toolkits, such as Camelot, Plumb a PDF (pdfnumber), and Paddle Paddle Structure V2 (PP-StructureV2), have been developed to facilitate table extraction from PDFs or images. However, each toolkit has its limitations. Camelot and pdfnumber can solely extract tables from digital PDFs and cannot handle image-based PDFs and pictures. On the other hand, PP-StructureV2 can comprehensively extract image-based PDFs and tables from pictures. Nevertheless, it lacks the ability to differentiate between diverse application scenarios, such as wired tables and wireless tables,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Handwritten Text Recognition Techniques · Image Processing and 3D Reconstruction
