LLM-supported document separation for printed reviews from zbMATH Open
Ivan Pluzhnikov, Ankit Satpute, Moritz Schubotz, Olaf Teschke, Bela Gipp

TL;DR
This paper develops an optimized LLM-based pipeline for digitizing and segmenting mathematical documents from zbMATH Open, significantly improving machine processing and accessibility of scanned literature.
Contribution
It introduces a novel LLM fine-tuning and voting framework for document separation, outperforming traditional methods and processing over 810,000 documents into machine-readable format.
Findings
Mathpix identified as the best OCR tool for LaTeX conversion.
Achieved 97.5% accuracy in document text extraction.
Processed 810,977 documents into machine-readable text.
Abstract
This paper presents a specialized methodology for digitizing and segmenting mathematical documents from zbMATH Open, a comprehensive database of mathematical literature, to enhance machine processing capabilities. Currently, approximately 831,000 documents exist only in scanned volumes, which makes them not machine-processable. Furthermore, these scans often span multiple pages or share pages with other documents and incorporate diverse typesetting techniques, posing challenges for automated processing. To address these issues, we evaluate various Optical Character Recognition (OCR) tools and document separation techniques, proposing an optimized pipeline that outperforms existing approaches. Our study identifies Mathpix as the most effective OCR tool for LaTeX conversion, demonstrating superior performance based on BLEU and Edit Distance metrics. For document separation, we fine-tune…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
