A Dynamic Framework for Semantic Grouping of Common Data Elements (CDE) Using Embeddings and Clustering
Madan Krishnamurthy, Daniel Korn, Melissa A Haendel, Christopher J Mungall, Anne E Thessen

TL;DR
This paper presents a scalable framework using LLM-based embeddings and hierarchical clustering to semantically group and harmonize Common Data Elements across diverse biomedical datasets, improving data interoperability.
Contribution
The study introduces a novel dynamic framework combining LLM embeddings, HDBSCAN clustering, and supervised classification for CDE harmonization, validated on large biomedical datasets.
Findings
Identified 118 meaningful CDE clusters from over 24,000 elements.
Achieved 90.46% accuracy in classifying new CDEs into clusters.
Demonstrated strong external validation metrics (ARI 0.52, NMI 0.78).
Abstract
This research aims to develop a dynamic and scalable framework to facilitate harmonization of Common Data Elements (CDEs) across heterogeneous biomedical datasets by addressing challenges such as semantic heterogeneity, structural variability, and context dependence to streamline integration, enhance interoperability, and accelerate scientific discovery. Our methodology leverages Large Language Models (LLMs) for context-aware text embeddings that convert CDEs into dense vectors capturing semantic relationships and patterns. These embeddings are clustered using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) to group semantically similar CDEs. The framework incorporates four key steps: (1) LLM-based text embedding to mathematically represent semantic context, (2) unsupervised clustering of embeddings via HDBSCAN, (3) automated labeling using LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Data Quality and Management · Advanced Database Systems and Queries
MethodsGravity
