DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM
Qintong Zhang, Junyuan Zhang, Zhifei Ren, Linke Ouyang, Zichen Wen, Junbo Niu, Yuan Qu, Bin Wang, Ka-Ho Chow, Conghui He, Wentao Zhang

TL;DR
DOCR-Inspector introduces a fine-grained, automated evaluation framework for document parsing that detects specific errors and assesses quality, surpassing existing models and aiding system improvement.
Contribution
This work presents a novel hierarchical error detection approach and a large annotated benchmark for comprehensive document parsing evaluation using vision language models.
Findings
DOCR-Inspector-7B outperforms commercial and open-source models on real-world cases.
The hierarchical Chain-of-Checklist reasoning improves error detection accuracy.
Quality assessments from DOCR-Inspector guide parsing refinement effectively.
Abstract
Document parsing aims to transform unstructured PDF images into semi-structured data, facilitating the digitization and utilization of information in diverse domains. While vision language models (VLMs) have significantly advanced this task, achieving reliable, high-quality parsing in real-world scenarios remains challenging. Common practice often selects the top-performing model on standard benchmarks. However, these benchmarks may carry dataset-specific biases, leading to inconsistent model rankings and limited correlation with real-world performance. Moreover, benchmark metrics typically provide only overall scores, which can obscure distinct error patterns in output. This raises a key challenge: how can we reliably and comprehensively assess document parsing quality in the wild? We address this problem with DOCR-Inspector, which formalizes document parsing assessment as fine-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Handwritten Text Recognition Techniques
