ParseBench: A Document Parsing Benchmark for AI Agents
Boyang Zhang, Sebasti\'an G. Acosta, Preston Carlson, Sacha Bron, Pierre-Lo\"ic Doulcet, Daniel B. Ospina, Simon Suo

TL;DR
ParseBench is a comprehensive benchmark with around 2,000 enterprise document pages designed to evaluate AI agents on semantic correctness across multiple document parsing capabilities.
Contribution
It introduces a new benchmark covering five key parsing dimensions and evaluates diverse methods, revealing capability gaps and the highest-performing system.
Findings
No method excels across all five dimensions.
LlamaParse Agentic achieves 84.9% overall score.
Benchmark exposes gaps in current document parsing systems.
Abstract
AI agents are changing the requirements for document parsing. What matters is semantic correctness: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce ParseBench, a benchmark of human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
