Efficient and Reliable Estimation of Named Entity Linking Quality: A Case Study on GutBrainIE
Marco Martinelli, Stefano Marchesin, Gianmaria Silvello

TL;DR
This paper introduces a sampling-based framework for efficiently estimating the quality of Named Entity Linking in large biomedical corpora, achieving high accuracy with reduced expert annotation effort.
Contribution
It adapts stratified two-stage cluster sampling to NEL accuracy estimation, providing a scalable, statistically robust method with proven efficiency gains.
Findings
Achieved a margin of error ≤ 0.05 with only 24.6% annotation effort.
Reduced expert annotation time by approximately 29% compared to baseline.
Validated framework on GutBrainIE corpus with high accuracy estimation performance.
Abstract
Named Entity Linking (NEL) is a core component of biomedical Information Extraction (IE) pipelines, yet assessing its quality at scale is challenging due to the high cost of expert annotations and the large size of corpora. In this paper, we present a sampling-based framework to estimate the NEL accuracy of large-scale IE corpora under statistical guarantees and constrained annotation budgets. We frame NEL accuracy estimation as a constrained optimization problem, where the objective is to minimize expected annotation cost subject to a target Margin of Error (MoE) for the corpus-level accuracy estimate. Building on recent works on knowledge graph accuracy estimation, we adapt Stratified Two-Stage Cluster Sampling (STWCS) to the NEL setting, defining label-based strata and global surface-form clusters in a way that is independent of NEL annotations. Applied to 11,184 NEL annotations in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Machine Learning in Healthcare
