# Annotating and indexing scientific articles with rare diseases

**Authors:** Hosein Azarbonyad, Zubair Afzal, Rik Iping, Max Dumoulin, Ilse Nederveen, Jiangtao Yu, Georgios Tsatsaronis

PMC · DOI: 10.1186/s13326-025-00346-1 · Journal of Biomedical Semantics · 2026-01-06

## TL;DR

This paper introduces a framework for automatically annotating and indexing scientific articles related to rare diseases using the OrphaNet taxonomy.

## Contribution

A novel scalable framework for rare disease literature annotation using OrphaNet, synonym expansion, and fuzzy matching.

## Key findings

- The framework achieves 92% precision, 75% recall, and 83% F1 on benchmark datasets.
- The pipeline generates disease-specific corpora for bibliometric and scientometric analyses.
- The Rare Diseases Monitor dashboard enables exploration of global research activity.

## Abstract

Around 30 million people in Europe are affected by a rare (or orphan) disease, defined as a condition occurring in fewer than 1 in 2,000 individuals. The primary challenge is to automatically and efficiently identify scientific articles and guidelines that address a particular rare disease. We present a novel methodology to annotate and index scientific text with taxonomical concepts describing rare diseases from the OrphaNet taxonomy. This task is complicated by several technical challenges, including the lack of sufficiently large, human-annotated datasets for supervised training and the polysemy/synonymy and surface-form variation of rare disease names, which can hinder any annotation engine.

We introduce a framework that operationalizes OrphaNet for large-scale literature annotation by integrating the TERMite engine with curated synonym expansion, label normalization (including deprecated/renamed concepts), and fuzzy matching. On benchmark datasets, the approach achieves precision = 92%, recall = 75%, and F1 = 83%, outperforming an string-matching baseline. Applying the pipeline to Scopus produces disease-specific corpora suitable for bibliometric and scientometric analyses (e.g., institution, country, and subject-area profiles). These outputs power the Rare Diseases Monitor dashboard for exploring national and global research activity.

To our knowledge, this is the first systematic, scalable semantic framework for annotating and indexing rare disease literature at scale. By operationalizing OrphaNet in an automated, reproducible pipeline and addressing data scarcity and lexical variability, the work advances biomedical semantics for rare diseases and enables disease-centric monitoring, evaluation, and discovery across the research landscape.

## Linked entities

- **Diseases:** rare diseases (MONDO:0021200)

## Full-text entities

- **Diseases:** Rare Diseases (MESH:D035583)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12870340/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12870340/full.md

## References

2 references — full list in the complete paper: https://tomesphere.com/paper/PMC12870340/full.md

---
Source: https://tomesphere.com/paper/PMC12870340