# A benchmark of text embedding models for semantic harmonization of Alzheimer's disease cohorts

**Authors:** Tim Adams, Yasamin Salimi, Mehmet Can Ay, Diego Valderrama, Marc Jacobs, Holger Fröhlich

PMC · DOI: 10.1016/j.tjpad.2025.100420 · 2025-12-01

## TL;DR

This paper introduces a new benchmark to evaluate text embedding models for harmonizing Alzheimer's disease cohort metadata, showing that domain-specific models perform better than general-purpose ones.

## Contribution

A novel benchmark and open-source tools for evaluating text embeddings in Alzheimer's data harmonization.

## Key findings

- Domain-specific models outperformed general-purpose models in clinical data harmonization.
- Formatting guidelines for metadata were proposed to improve harmonization processes.
- An open-source library and leaderboard were introduced to support future benchmarking.

## Abstract

Harmonizing diverse healthcare datasets is a challenging task due to inconsistent naming conventions. Manual harmonization is time- and resource-intensive, limiting scalability for multi-cohort Alzheimer's Disease research. Large Language Models, or specifically text-embedding models, offer a promising solution, but their rapid development necessitates continuous, domain-specific benchmarking, especially since general established benchmarks lack clinical data harmonization use cases.

To evaluate how different text-embedding models perform for the harmonization of clinical variables.

We created a novel benchmark to assess how well different Language Model embeddings can be used to harmonize cohort study metadata with an in-house Common Data Model that includes cohort-to-cohort mappings for a wide range of Alzheimer’s Disease cohorts. We evaluated five different state-of-the-art text embedding models for seven different data sets in the context of Alzheimer’s disease.

No patient data were utilized for any of the analyses, as the evaluation was based on semantic harmonization of cohort metadata only.

Text descriptions of variables from different modalities were included for the analyses, namely clinical, lifestyle, demographics, and imaging.

Our benchmark results favored different models compared to general-purpose benchmarks. This suggests that models fine-tuned for generic tasks may not translate well to real-world data harmonization, particularly in Alzheimer’s disease. We propose guidelines to format metadata to facilitate manual or model-assisted data harmonization. We introduce an open-source library (https://github.com/SCAI-BIO/ADHTEB) and an interactive leaderboard (https://adhteb.scai.fraunhofer.de) to aid future model benchmarking.

Our findings highlight the importance of domain-specific benchmarks for clinical data harmonization in the field of Alzheimer’s disease and motivate standards for naming conventions that may support semi-automated mapping applications in the future.

## Linked entities

- **Diseases:** Alzheimer's disease (MONDO:0004975)

## Full-text entities

- **Diseases:** Alzheimer's Disease (MESH:D000544)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12811766/full.md

---
Source: https://tomesphere.com/paper/PMC12811766