Accelerating End-to-End PDF to Markdown Conversion Through Assisted Generation
Changxu Duan

TL;DR
This paper introduces Copy Lookup Decoding (CLD), a novel method that accelerates PDF to Markdown conversion by leveraging high n-gram overlap, significantly improving efficiency over existing transformer-based models.
Contribution
The paper proposes CLD, an innovative decoding technique that enhances prompt lookup decoding by directly extracting candidate sequences from PDFs, enabling faster conversion.
Findings
CLD accelerates PDF to Markdown conversion by up to 1.70×.
The method maintains original quality while improving efficiency.
Open-source implementation available on GitHub.
Abstract
Converting data from machine-unreadable formats like PDFs into Markdown has the potential to enhance the accessibility of scientific research. Existing end-to-end decoder transformer models can transform screenshots of PDFs into Markdown, offering more flexibility than pipeline-based methods. Yet, decoding text token by token from scratch is inefficient, especially when dense text can be directly copied from the PDF. To address this challenge, this paper modifies Prompt Lookup Decoding (PLD) to extract candidate sequences directly from PDF files, leveraging the high n-gram overlap between PDFs and their Markdown equivalents. A new method, Copy Lookup Decoding (CLD), is introduced here to enhance PLD's candidate generation mechanism. Experiments demonstrate that CLD can accelerate the conversion process by up to 1.70 at original quality. The codebase for this paper is open-source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Digital Humanities and Scholarship · Scientific Computing and Data Management
