LLM-supported document separation for printed reviews from zbMATH Open

Ivan Pluzhnikov; Ankit Satpute; Moritz Schubotz; Olaf Teschke; Bela Gipp

arXiv:2604.00554·cs.DL·April 2, 2026

LLM-supported document separation for printed reviews from zbMATH Open

Ivan Pluzhnikov, Ankit Satpute, Moritz Schubotz, Olaf Teschke, Bela Gipp

PDF

TL;DR

This paper develops an optimized LLM-based pipeline for digitizing and segmenting mathematical documents from zbMATH Open, significantly improving machine processing and accessibility of scanned literature.

Contribution

It introduces a novel LLM fine-tuning and voting framework for document separation, outperforming traditional methods and processing over 810,000 documents into machine-readable format.

Findings

01

Mathpix identified as the best OCR tool for LaTeX conversion.

02

Achieved 97.5% accuracy in document text extraction.

03

Processed 810,977 documents into machine-readable text.

Abstract

This paper presents a specialized methodology for digitizing and segmenting mathematical documents from zbMATH Open, a comprehensive database of mathematical literature, to enhance machine processing capabilities. Currently, approximately 831,000 documents exist only in scanned volumes, which makes them not machine-processable. Furthermore, these scans often span multiple pages or share pages with other documents and incorporate diverse typesetting techniques, posing challenges for automated processing. To address these issues, we evaluate various Optical Character Recognition (OCR) tools and document separation techniques, proposing an optimized pipeline that outperforms existing approaches. Our study identifies Mathpix as the most effective OCR tool for LaTeX conversion, demonstrating superior performance based on BLEU and Edit Distance metrics. For document separation, we fine-tune…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.