Layout-Aware Text Editing for Efficient Transformation of Academic PDFs to Markdown

Changxu Duan

arXiv:2512.18115·cs.MM·December 23, 2025

Layout-Aware Text Editing for Efficient Transformation of Academic PDFs to Markdown

Changxu Duan

PDF

Open Access

TL;DR

This paper introduces EditTrans, a hybrid model that efficiently transforms academic PDFs into markup language by leveraging layout-aware editing, significantly reducing inference time while maintaining quality.

Contribution

The paper presents EditTrans, a novel layout-aware editing-generation model that improves PDF to markup transformation efficiency through a lightweight classification approach.

Findings

01

Reduced transformation latency by up to 44.5%

02

Maintained high transformation quality

03

Leveraged a fine-tuned Document Layout Analysis model

Abstract

Academic documents stored in PDF format can be transformed into plain text structured markup languages to enhance accessibility and enable scalable digital library workflows. Markup languages allow for easier updates and customization, making academic content more adaptable and accessible to diverse usage, such as linguistic corpus compilation. Such documents, typically delivered in PDF format, contain complex elements including mathematical formulas, figures, headers, and tables, as well as densely layouted text. Existing end-to-end decoder transformer models can transform screenshots of documents into markup language. However, these models exhibit significant inefficiencies; their token-by-token decoding from scratch wastes a lot of inference steps in regenerating dense text that could be directly copied from PDF files. To solve this problem, we introduce EditTrans, a hybrid…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Digital Humanities and Scholarship · Handwritten Text Recognition Techniques