ParseBench: A Document Parsing Benchmark for AI Agents

Boyang Zhang; Sebasti\'an G. Acosta; Preston Carlson; Sacha Bron; Pierre-Lo\"ic Doulcet; Daniel B. Ospina; Simon Suo

arXiv:2604.08538·cs.CV·April 14, 2026

ParseBench: A Document Parsing Benchmark for AI Agents

Boyang Zhang, Sebasti\'an G. Acosta, Preston Carlson, Sacha Bron, Pierre-Lo\"ic Doulcet, Daniel B. Ospina, Simon Suo

PDF

1 Repo 4 Datasets

TL;DR

ParseBench is a comprehensive benchmark with around 2,000 enterprise document pages designed to evaluate AI agents on semantic correctness across multiple document parsing capabilities.

Contribution

It introduces a new benchmark covering five key parsing dimensions and evaluates diverse methods, revealing capability gaps and the highest-performing system.

Findings

01

No method excels across all five dimensions.

02

LlamaParse Agentic achieves 84.9% overall score.

03

Benchmark exposes gaps in current document parsing systems.

Abstract

AI agents are changing the requirements for document parsing. What matters is semantic correctness: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce ParseBench, a benchmark of $\sim 2, 000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

run-llama/ParseBench
github

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.