Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs
Pius Horn, Janis Keuper

TL;DR
This paper introduces a benchmarking framework for evaluating PDF document parsers on mathematical formula extraction, using synthetic PDFs with LaTeX ground truth and LLM-based semantic evaluation.
Contribution
It presents a novel benchmarking approach with a synthetic dataset, LLM-based semantic evaluation, and a comprehensive comparison of over 20 parsers.
Findings
LLM-based evaluation correlates highly with human judgment (r=0.78).
Significant performance disparities exist among different PDF parsers.
The framework provides actionable insights for parser selection in scientific applications.
Abstract
Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. For evaluation, we apply LLM-as-a-judge to assess semantic equivalence of parsed formulas, capturing mathematical meaning beyond surface-level notation differences. We validate this approach through a human study (250 formula pairs, 750 ratings from 30 evaluators), showing a Pearson correlation of r=0.78 with human judgment, compared to r=0.34 for character-level matching (CDM) and r~0 for text similarity. Our robust two-stage matching…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
