# Gold standard, multi-genre dataset for named entity recognition and linking

**Authors:** Szymon Olewniczak, Julian Szymański

PMC · DOI: 10.1038/s41597-025-05274-4 · Scientific Data · 2025-06-13

## TL;DR

This paper introduces a high-quality, multi-genre dataset for evaluating entity-linking systems using Wikipedia as a knowledge base.

## Contribution

The novelty lies in creating a dataset with diverse domains and annotated entity types for entity-linking evaluation.

## Key findings

- The dataset covers multiple domains, unlike most datasets focused on single domains.
- Each text segment is annotated with its corresponding entity type for improved reliability.
- The dataset is publicly available for download via a provided DOI.

## Abstract

In our study, we introduce a new dataset designed for the evaluation of entity-linking systems. Entity Linking (EL) involves identifying specific segments in a text so-called mentions and linking them to relevant entries in an external Knowledge Base (KB). EL is a challenging task with numerous complexities, making it vital to have access to high-quality data for testing. Our dataset is unique in that it encompasses texts from various domains, contrasting with the common focus on single domains, such as newspaper news, in most current datasets. Furthermore, we have annotated each identified text segment with its corresponding entity type, enhancing the dataset’s usefulness and reliability. This dataset employs Wikipedia as its Knowledge Base, which is the prevalent choice for general domain entity linking systems. The dataset is available to download from 10.34808/f3q9-9k64.

## Full-text entities

- **Genes:** AP2B1 (adaptor related protein complex 2 subunit beta 1) [NCBI Gene 163] {aka ADTB2, AP105B, AP2-BETA, CLAPB1}, AIDA (axin interactor, dorsalization associated) [NCBI Gene 64853] {aka C1orf80}, AGRP (agouti related neuropeptide) [NCBI Gene 181] {aka AGRT, ART, ASIP2}, SPI1 (Spi-1 proto-oncogene) [NCBI Gene 6688] {aka AGM10, OF, PU.1, SFPI1, SPI-1, SPI-A}
- **Diseases:** EVENT (MESH:D002318), DISEASE (MESH:D004194), KB (MESH:D019292), World War II (MESH:D000067398), EL (MESH:C536424), SPECIE (MESH:C564159), SUBSTANCE (MESH:D019966)
- **Chemicals:** Nitazoxanide (MESH:C041747), gold (MESH:D006046), AQUAINT (-)
- **Species:** Escherichia coli (E. coli, species) [taxon 562], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12166075/full.md

## Figures

16 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12166075/full.md

## References

21 references — full list in the complete paper: https://tomesphere.com/paper/PMC12166075/full.md

---
Source: https://tomesphere.com/paper/PMC12166075