Accelerating End-to-End PDF to Markdown Conversion Through Assisted Generation

Changxu Duan

arXiv:2512.18122·cs.MM·December 23, 2025

Accelerating End-to-End PDF to Markdown Conversion Through Assisted Generation

Changxu Duan

PDF

Open Access

TL;DR

This paper introduces Copy Lookup Decoding (CLD), a novel method that accelerates PDF to Markdown conversion by leveraging high n-gram overlap, significantly improving efficiency over existing transformer-based models.

Contribution

The paper proposes CLD, an innovative decoding technique that enhances prompt lookup decoding by directly extracting candidate sequences from PDFs, enabling faster conversion.

Findings

01

CLD accelerates PDF to Markdown conversion by up to 1.70×.

02

The method maintains original quality while improving efficiency.

03

Open-source implementation available on GitHub.

Abstract

Converting data from machine-unreadable formats like PDFs into Markdown has the potential to enhance the accessibility of scientific research. Existing end-to-end decoder transformer models can transform screenshots of PDFs into Markdown, offering more flexibility than pipeline-based methods. Yet, decoding text token by token from scratch is inefficient, especially when dense text can be directly copied from the PDF. To address this challenge, this paper modifies Prompt Lookup Decoding (PLD) to extract candidate sequences directly from PDF files, leveraging the high n-gram overlap between PDFs and their Markdown equivalents. A new method, Copy Lookup Decoding (CLD), is introduced here to enhance PLD's candidate generation mechanism. Experiments demonstrate that CLD can accelerate the conversion process by up to 1.70 $\times$ at original quality. The codebase for this paper is open-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Digital Humanities and Scholarship · Scientific Computing and Data Management