READoc: A Unified Benchmark for Realistic Document Structured Extraction
Zichao Li, Aizier Abulaiti, Yaojie Lu, Xuanang Chen, Jia Zheng, Hongyu Lin, Xianpei Han, Le Sun

TL;DR
READoc introduces a comprehensive benchmark for realistic document structured extraction, converting unstructured PDFs into semantic Markdown, enabling unified evaluation of diverse DSE systems and highlighting current gaps.
Contribution
We present READoc, a new dataset and evaluation suite for DSE, addressing fragmented benchmarks and enabling holistic assessment of extraction methods from real-world documents.
Findings
Current DSE methods show significant performance gaps.
Unified evaluation reveals strengths and weaknesses of existing approaches.
READoc facilitates future research and development in practical DSE solutions.
Abstract
Document Structured Extraction (DSE) aims to extract structured content from raw documents. Despite the emergence of numerous DSE systems, their unified evaluation remains inadequate, significantly hindering the field's advancement. This problem is largely attributed to existing benchmark paradigms, which exhibit fragmented and localized characteristics. To address these limitations and offer a thorough evaluation of DSE systems, we introduce a novel benchmark named READoc, which defines DSE as a realistic task of converting unstructured PDFs into semantically rich Markdown. The READoc dataset is derived from 3,576 diverse and real-world documents from arXiv, GitHub, and Zenodo. In addition, we develop a DSE Evaluation Suite comprising Standardization, Segmentation and Scoring modules, to conduct a unified evaluation of state-of-the-art DSE approaches. By evaluating a range of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing
