A Comparative Study of PDF Parsing Tools Across Diverse Document Categories
Narayan S. Adhikari, Shradha Agarwal

TL;DR
This study compares 10 PDF parsing tools across six document categories, evaluating their effectiveness in text extraction and table detection, revealing strengths and weaknesses tailored to document types and tasks.
Contribution
It provides a comprehensive comparison of PDF parsing tools across diverse document types, highlighting their performance variations and guiding tool selection for specific applications.
Findings
PyMuPDF and pypdfium excel in text extraction overall.
Learning-based tools like Nougat perform better on scientific and patent documents.
TATR outperforms in table detection for scientific and financial documents.
Abstract
PDF is one of the most prominent data formats, making PDF parsing crucial for information extraction and retrieval, particularly with the rise of RAG systems. While various PDF parsing tools exist, their effectiveness across different document types remains understudied, especially beyond academic papers. Our research aims to address this gap by comparing 10 popular PDF parsing tools across 6 document categories using the DocLayNet dataset. These tools include PyPDF, pdfminer-six, PyMuPDF, pdfplumber, pypdfium2, Unstructured, Tabula, Camelot, as well as the deep learning-based tools Nougat and Table Transformer(TATR). We evaluated both text extraction and table detection capabilities. For text extraction, PyMuPDF and pypdfium generally outperformed others, but all parsers struggled with Scientific and Patent documents. For these challenging categories, learning-based tools like Nougat…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Multi-Head Attention · Dense Connections · WordPiece · Residual Connection · Linear Warmup With Linear Decay · Dropout · Layer Normalization · Adam
