Evidence Units: Ontology-Grounded Document Organization for Parser-Independent Retrieval
Yeonjee Han

TL;DR
This paper introduces Evidence Units, semantically complete document chunks that improve parser-independent retrieval by grouping visual assets with contextual text, validated across multiple parsers and datasets.
Contribution
It presents an ontology-grounded schema, a global assignment algorithm, a graph-based validation layer, and cross-parser validation for constructing Evidence Units.
Findings
EU-based chunking improves retrieval LCS by +0.31.
Recall@1 increases from 0.15 to 0.51 with EU-based chunking.
Cross-parser results confirm the robustness of Evidence Units.
Abstract
Structured documents--tables paired with captions, figures with explanations, equations with the paragraphs that interpret them--are routinely fragmented when indexed for retrieval. Element-level indexing treats every parsed element as an independent chunk, scattering semantically cohesive units across separate retrieval candidates. This paper presents a parser-independent pipeline that constructs Evidence Units (EUs): semantically complete document chunks that group visual assets with their contextual text. We introduce four contributions: (1) ontology-grounded role normalization extending DoCO that maps heterogeneous parser outputs to a unified semantic schema; (2) a semantic global assignment algorithm that optimally assigns paragraphs to EUs via a full similarity matrix; (3) a graph-based decision layer in Neo4j that formalizes EU construction rules and validates completeness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
