Key Coverage Matters: Semi-Structured Extraction of OCR Clinical Reports

Yu Wang; Yingyun Li; Ying Qin; Haiyang Qian

arXiv:2605.09440·cs.CL·May 12, 2026

Key Coverage Matters: Semi-Structured Extraction of OCR Clinical Reports

Yu Wang, Yingyun Li, Ying Qin, Haiyang Qian

PDF

TL;DR

This paper presents a semi-structured extraction method for OCR clinical reports that improves information retrieval by focusing on key coverage, using a BERT-based model and iterative key mining across diverse hospitals.

Contribution

The authors introduce a novel key-conditioned extractive question answering framework with an open key space, emphasizing key coverage for improved extraction from noisy OCR clinical reports.

Findings

01

Performance improves monotonically with key coverage.

02

Achieves F1 scores of 0.839 (exact match) and 0.893 (boundary-tolerant) at Top-90 key coverage.

03

Outperforms baseline models once key coverage reaches 90%."],

Abstract

Clinical reports are often fragmented across healthcare institutions because privacy regulations and data silos limit direct information sharing. When patients seek care at a different hospital, they often carry paper or scanned reports from prior visits. This hinders EHR integration and longitudinal review, and downstream applications that depend on more complete patient records, such as patient management, follow-up care, real-world studies, and clinical-trial matching. Although OCR can digitize such reports, reliable extraction remains challenging because clinical documents are heterogeneous, OCR text is noisy, and many healthcare settings require low-cost on-premise deployment. We formulate this problem as canonical key-conditioned extractive question answering over OCR-derived clinical reports. Because the key fields are neither fixed nor known in advance, the key space is open. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.