# Generative Models and Sentence Transformers for the Recognition and Normalization of Continuous and Discontinuous Phenotype Mentions: Model Development and Evaluation

**Authors:** Areej Alhassan, Viktor Schlegel, Monira Aloud, Riza Batista-Navarro, Goran Nenadic

PMC · DOI: 10.2196/68558 · 2025-11-05

## TL;DR

This paper presents a system for identifying and normalizing genetic phenotype mentions in clinical reports, especially handling discontinuous mentions effectively.

## Contribution

The novel contribution is a 2-phase pipeline, DiscHPO, using generative models and sentence transformers to handle both continuous and discontinuous phenotype mentions.

## Key findings

- The system achieved an F1-score of 0.723 for entity normalization and 0.665 for span extraction, outperforming baseline models.
- The model demonstrated the ability to recognize discontinuous spans with an F1-score of 0.631 on the validation set.
- Partial mention matches can be sufficient for successful normalization, supporting the system's utility in clinical tasks.

## Abstract

Extracting genetic phenotype mentions from clinical reports and normalizing them to standardized concepts within the human phenotype ontology are essential for consistent interpretation and representation of genetic conditions. This is particularly important in fields such as dysmorphology and plays a key role in advancing personalized health care. However, modern clinical named entity recognition methods face challenges in accurately identifying discontinuous mentions (ie, entity spans that are interrupted by unrelated words), which can be found in these clinical reports.

This study aims to develop a system that can accurately extract and normalize genetic phenotypes, specifically from physical examination reports related to dysmorphology assessment. These mentions appear in both continuous and discontinuous lexical forms, with a focus on addressing challenging discontinuous entity spans.

We introduce DiscHPO, a 2-phase pipeline consisting of a sequence-to-sequence named entity recognition model for span extraction, and an entity normalizer that uses a sentence transformer biencoder for candidate generation and a cross-encoder reranker for selecting the best candidate as the normalized concept. This system was tested as part of our participation in Track 3 of the BioCreative VIII shared task.

For overall performance on the test set, the top-performing model for entity normalization achieved an F1-score of 0.723, while the best span extraction model reached an F1-score of 0.665. Both scores surpassed those of 2 baseline models using the same dataset, indicating superior efficacy in handling both continuous and discontinuous spans. On the validation set, we were able to demonstrate our system’s ability to recognize these mentions, with the model achieving an F1-score of 0.631 for exact match on discontinuous spans only.

The findings suggest that exact extraction of entity spans may not always be necessary for successful normalization. Partial mention matches can be sufficient as long as they capture the essential concept information, supporting the system’s utility in clinical downstream tasks.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12631088/full.md

---
Source: https://tomesphere.com/paper/PMC12631088