A benchmark of text embedding models for semantic harmonization of Alzheimer's disease cohorts

Tim Adams; Yasamin Salimi; Mehmet Can Ay; Diego Valderrama; Marc Jacobs; Holger Fröhlich

PMC · DOI:10.1016/j.tjpad.2025.100420·December 1, 2025

A benchmark of text embedding models for semantic harmonization of Alzheimer's disease cohorts

Tim Adams, Yasamin Salimi, Mehmet Can Ay, Diego Valderrama, Marc Jacobs, Holger Fröhlich

PDF

Open Access

TL;DR

This paper introduces a new benchmark to evaluate text embedding models for harmonizing Alzheimer's disease cohort metadata, showing that domain-specific models perform better than general-purpose ones.

Contribution

A novel benchmark and open-source tools for evaluating text embeddings in Alzheimer's data harmonization.

Findings

01

Domain-specific models outperformed general-purpose models in clinical data harmonization.

02

Formatting guidelines for metadata were proposed to improve harmonization processes.

03

An open-source library and leaderboard were introduced to support future benchmarking.

Abstract

Harmonizing diverse healthcare datasets is a challenging task due to inconsistent naming conventions. Manual harmonization is time- and resource-intensive, limiting scalability for multi-cohort Alzheimer's Disease research. Large Language Models, or specifically text-embedding models, offer a promising solution, but their rapid development necessitates continuous, domain-specific benchmarking, especially since general established benchmarks lack clinical data harmonization use cases. To evaluate how different text-embedding models perform for the harmonization of clinical variables. We created a novel benchmark to assess how well different Language Model embeddings can be used to harmonize cohort study metadata with an in-house Common Data Model that includes cohort-to-cohort mappings for a wide range of Alzheimer’s Disease cohorts. We evaluated five different state-of-the-art text…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases1

Alzheimer's disease

Figures3

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Biomedical Text Mining and Ontologies · Genomics and Rare Diseases