AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
Carlo Siebenschuh, Kyle Hippe, Ozan Gokdemir, Alexander Brace, Arham, Khan, Khalid Hossain, Yadu Babuji, Nicholas Chia, Venkatram Vishwanath, Rick, Stevens, Arvind Ramanathan, Ian Foster, Robert Underwood

TL;DR
AdaParse is an adaptive, resource-efficient PDF parsing engine that intelligently assigns parsers to scientific documents, significantly boosting throughput while maintaining high accuracy for large-scale scientific text processing.
Contribution
It introduces a data-driven, adaptive approach that optimally selects parsers based on document complexity and human preferences, enhancing scalability and accuracy in scientific PDF parsing.
Findings
Achieves 17x throughput improvement over state-of-the-art parsers.
Maintains or slightly improves accuracy (by 0.2%) on a benchmark of 1000 documents.
Enables large-scale scientific document corpus processing for high-quality dataset creation.
Abstract
Language models for scientific tasks are trained on text from scientific publications, most distributed as PDFs that require parsing. PDF parsing approaches range from inexpensive heuristics (for simple documents) to computationally intensive ML-driven systems (for complex or degraded ones). The choice of the "best" parser for a particular document depends on its computational cost and the accuracy of its output. To address these issues, we introduce an Adaptive Parallel PDF Parsing and Resource Scaling Engine (AdaParse), a data-driven strategy for assigning an appropriate parser to each document. We enlist scientists to select preferred parser outputs and incorporate this information through direct preference optimization (DPO) into AdaParse, thereby aligning its selection process with human judgment. AdaParse then incorporates hardware requirements and predicted accuracy of each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Advanced Database Systems and Queries · Advanced Data Storage Technologies
MethodsSparse Evolutionary Training
