# Comprehensive representation of health-related phenotypes in one million dogs using topic modelling of electronic health records

**Authors:** Peter-John Mäntylä Noble, Sean Oliver Farrell, Noura Al-Moubayed, Alan David Radford

PMC · DOI: 10.1186/s40537-026-01365-0 · 2026-02-24

## TL;DR

This paper uses machine learning to analyze a million dogs' health records, uncovering known and new disease patterns.

## Contribution

A novel application of BERTopic for extracting health-related phenotypes from veterinary clinical notes at scale.

## Key findings

- BERTopic successfully identified known breed predispositions to diseases like hypoadrenocorticism and diabetes.
- The method revealed potential novel patterns in disease phenotypes across a large population of dogs.
- The approach enables rapid and scalable interrogation of clinical datasets for diverse health insights.

## Abstract

Historically, veterinary studies screening for breed, age and sex predisposition to disease have relied on collating small-scale studies of clinical datasets. The availability of larger datasets through groups such as the Small Animal Veterinary Surveillance Network (SAVSNET) promise access to information regarding a wide range of clinical presentations at scale, however, methodological limitations surrounding the extraction of specific disease information or screening for disease predispositions result in a substantial reduction in the number of animals studied. These studies often address very focused hypotheses - only leveraging a small fraction of the intrinsic value of the data at any one time. Here, we implemented an unsupervised machine learning methodology, creating a representation of a large volume of clinical notes collected by SAVSNET from veterinary practices across the UK. We utilise BERTopic, a topic-modelling tool based on Bidirectional Encoder Representations using Transformers (BERT) architecture, and show it is able to surface known phenotypes, such as breed predispositions to hypoadrenocorticism, diabetes mellitus and mitral valve disease, as well as potential novel patterns of disease phenotypes. This scalable and granular modelling technique facilitates the rapid interrogation of large clinical datasets, enabling the identification of a broad range of phenotypes within the population and the early detection of temporal changes indicative of emerging infectious or environmental diseases.

The online version contains supplementary material available at 10.1186/s40537-026-01365-0.

## Linked entities

- **Diseases:** diabetes mellitus (MONDO:0005015), mitral valve disease (MONDO:0003767)
- **Species:** Canis lupus familiaris (taxon 9615)

## Full-text entities

- **Diseases:** diseases (MESH:D004194), hypoadrenocorticism (MESH:D000075262), infectious (MESH:D003141), mitral valve disease (MESH:D008946), diabetes mellitus (MESH:D003920)
- **Species:** Canis lupus familiaris (dog, subspecies) [taxon 9615]

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13035608/full.md

---
Source: https://tomesphere.com/paper/PMC13035608