ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction
Nick Ferguson, Josh Pennington, Narek Beghian, Aravind Mohan, Douwe Kiela, Sheshansh Agrawal, Thien Hang Nguyen

TL;DR
ExtractBench provides a comprehensive benchmark and evaluation methodology for assessing the accuracy and reliability of large language models in extracting structured data from PDFs into JSON format, especially across complex schemas.
Contribution
It introduces a novel benchmark and evaluation framework that captures the semantics of nested extraction and schema-specific correctness, addressing key gaps in current evaluation methods.
Findings
Frontier models perform poorly on complex schemas.
Model accuracy drops significantly as schema complexity increases.
No model achieved valid output on a 369-field financial schema.
Abstract
Unstructured documents like PDFs contain valuable structured information, but downstream systems require this data in reliable, standardized formats. LLMs are increasingly deployed to automate this extraction, making accuracy and reliability paramount. However, progress is bottlenecked by two gaps. First, no end-to-end benchmark evaluates PDF-to-JSON extraction under enterprise-scale schema breadth. Second, no principled methodology captures the semantics of nested extraction, where fields demand different notions of correctness (exact match for identifiers, tolerance for quantities, semantic equivalence for names), arrays require alignment, and omission must be distinguished from hallucination. We address both gaps with ExtractBench, an open-source benchmark and evaluation framework for PDF-to-JSON structured extraction. The benchmark pairs 35 PDF documents with JSON Schemas and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFinancial Reporting and XBRL · Handwritten Text Recognition Techniques · Advanced Text Analysis Techniques
