PdfTable: A Unified Toolkit for Deep Learning-Based Table Extraction

Lei Sheng; Shuai-Shuai Xu

arXiv:2409.05125·cs.CV·September 10, 2024

PdfTable: A Unified Toolkit for Deep Learning-Based Table Extraction

Lei Sheng, Shuai-Shuai Xu

PDF

Open Access 1 Repo

TL;DR

PdfTable is a comprehensive toolkit that integrates multiple models and tools to improve deep learning-based table extraction from diverse document formats, including PDFs and images, across various scenarios.

Contribution

It introduces a unified, adaptable toolkit combining recognition, OCR, and layout analysis models to handle diverse table extraction challenges in unstructured documents.

Findings

01

Effective on wired and wireless table datasets

02

Handles both digital and image-based PDFs

03

Open-source implementation available on GitHub

Abstract

Currently, a substantial volume of document data exists in an unstructured format, encompassing Portable Document Format (PDF) files and images. Extracting information from these documents presents formidable challenges due to diverse table styles, complex forms, and the inclusion of different languages. Several open-source toolkits, such as Camelot, Plumb a PDF (pdfnumber), and Paddle Paddle Structure V2 (PP-StructureV2), have been developed to facilitate table extraction from PDFs or images. However, each toolkit has its limitations. Camelot and pdfnumber can solely extract tables from digital PDFs and cannot handle image-based PDFs and pictures. On the other hand, PP-StructureV2 can comprehensively extract image-based PDFs and tables from pictures. Nevertheless, it lacks the ability to differentiate between diverse application scenarios, such as wired tables and wireless tables,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cycloneboy/pdf_table
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Handwritten Text Recognition Techniques · Image Processing and 3D Reconstruction