# Ontology- and LLM-based data harmonization for federated learning in healthcare

**Authors:** Natallia Kokash, Lei Wang, Thomas H. Gillespie, Adam S. Z. Belloum, Paola Grosso, Sara Quinney, Lang Li, Bernard de Bono

PMC · DOI: 10.3389/fdgth.2026.1756555 · Frontiers in Digital Health · 2026-03-18

## TL;DR

This paper introduces a new method using ontologies and large language models to harmonize clinical data for privacy-preserving healthcare research.

## Contribution

A two-step pipeline combining ontology-based retrieval and LLM validation for scalable data harmonization in federated learning.

## Key findings

- Expert-LLM agreement reached up to 92% in mapping clinical data to ontologies.
- LLM-based validation improved precision, while complementary retrieval strategies improved recall.
- The pipeline transforms manual ontology mapping into a reusable and configurable workflow.

## Abstract

Semantic heterogeneity across electronic health records (EHRs) limits scalable and privacy-preserving analytics in healthcare. While federated learning (FL) enables collaborative modeling without sharing raw data, it requires consistent, ontology-aligned representations. We present an ontology- and large language model (LLM)-based data harmonization approach to support secure, interoperable FL workflows.

We propose a general two-step pipeline for converting or annotating clinical text into a predefined target ontology format. First, candidate concepts are retrieved from the target vocabulary using embedding-based similarity search or ontology cross-references. Second, an LLM acts as a semantic validator, accepting or rejecting candidates based on explicit equivalence or subsumption criteria. The approach is ontology-agnostic and configurable; mapping to MONDO and HPO is demonstrated as a real-world use case. Final accepted mappings were evaluated against independent human expert assessment.

Across two clinical datasets, expert-LLM agreement reached up to 92%, with overall performance ranging from 78% to 91% depending on candidate-generation strategy. Retrieval alone was insufficient for reliable mapping, whereas LLM-based validation substantially improved precision while complementary retrieval strategies improved recall.

The proposed pipeline transforms ontology-based harmonization from a manual expert task into a reusable and configurable workflow suitable for federated healthcare research. By combining high-recall retrieval with LLM-based semantic adjudication, the approach enables scalable, privacy-preserving conversion of heterogeneous clinical text into standardized representations across domains.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13040560/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13040560/full.md

## References

51 references — full list in the complete paper: https://tomesphere.com/paper/PMC13040560/full.md

---
Source: https://tomesphere.com/paper/PMC13040560