Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI
Saurabh K. Singh, Sachin Raj

TL;DR
This paper introduces EnterpriseDocBench, a unified evaluation framework for enterprise multimodal document processing pipelines, assessing parsing, retrieval, and generation quality on a common corpus.
Contribution
It presents a comprehensive benchmarking approach, including a new corpus and evaluation metrics, for assessing entire enterprise document AI pipelines holistically.
Findings
Hybrid retrieval slightly outperforms BM25 in relevance.
Hallucination rates vary with document length, peaking at short and very long documents.
Weak correlations between pipeline stages suggest limited cascading effects.
Abstract
Most enterprise document AI today is a pipeline. Parse, index, retrieve, generate. Each of those stages has been studied to death on its own -- what's still hard is evaluating the system as a whole. We built EnterpriseDocBench to take a swing at it: parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness, all on the same corpus. The corpus is built from public, permissively licensed documents across six enterprise domains (five represented in the current pilot). We ran three pipelines through it -- BM25, dense embedding, and a hybrid -- all with the same GPT-5 generator. The headline numbers: hybrid retrieval narrowly beats BM25 (nDCG@5 of 0.92 vs. 0.91), and both beat dense embedding (0.83). Hallucination doesn't grow monotonically with document length -- short documents and very long ones both hallucinate more than medium ones (28.1% and 23.8% vs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
