Evidence Units: Ontology-Grounded Document Organization for Parser-Independent Retrieval

Yeonjee Han

arXiv:2604.00500·cs.IR·April 2, 2026

Evidence Units: Ontology-Grounded Document Organization for Parser-Independent Retrieval

Yeonjee Han

PDF

TL;DR

This paper introduces Evidence Units, semantically complete document chunks that improve parser-independent retrieval by grouping visual assets with contextual text, validated across multiple parsers and datasets.

Contribution

It presents an ontology-grounded schema, a global assignment algorithm, a graph-based validation layer, and cross-parser validation for constructing Evidence Units.

Findings

01

EU-based chunking improves retrieval LCS by +0.31.

02

Recall@1 increases from 0.15 to 0.51 with EU-based chunking.

03

Cross-parser results confirm the robustness of Evidence Units.

Abstract

Structured documents--tables paired with captions, figures with explanations, equations with the paragraphs that interpret them--are routinely fragmented when indexed for retrieval. Element-level indexing treats every parsed element as an independent chunk, scattering semantically cohesive units across separate retrieval candidates. This paper presents a parser-independent pipeline that constructs Evidence Units (EUs): semantically complete document chunks that group visual assets with their contextual text. We introduce four contributions: (1) ontology-grounded role normalization extending DoCO that maps heterogeneous parser outputs to a unified semantic schema; (2) a semantic global assignment algorithm that optimally assigns paragraphs to EUs via a full similarity matrix; (3) a graph-based decision layer in Neo4j that formalizes EU construction rules and validates completeness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.