Layout-Aware Text Editing for Efficient Transformation of Academic PDFs to Markdown
Changxu Duan

TL;DR
This paper introduces EditTrans, a hybrid model that efficiently transforms academic PDFs into markup language by leveraging layout-aware editing, significantly reducing inference time while maintaining quality.
Contribution
The paper presents EditTrans, a novel layout-aware editing-generation model that improves PDF to markup transformation efficiency through a lightweight classification approach.
Findings
Reduced transformation latency by up to 44.5%
Maintained high transformation quality
Leveraged a fine-tuned Document Layout Analysis model
Abstract
Academic documents stored in PDF format can be transformed into plain text structured markup languages to enhance accessibility and enable scalable digital library workflows. Markup languages allow for easier updates and customization, making academic content more adaptable and accessible to diverse usage, such as linguistic corpus compilation. Such documents, typically delivered in PDF format, contain complex elements including mathematical formulas, figures, headers, and tables, as well as densely layouted text. Existing end-to-end decoder transformer models can transform screenshots of documents into markup language. However, these models exhibit significant inefficiencies; their token-by-token decoding from scratch wastes a lot of inference steps in regenerating dense text that could be directly copied from PDF files. To solve this problem, we introduce EditTrans, a hybrid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Digital Humanities and Scholarship · Handwritten Text Recognition Techniques
