# Unveiling patterns in clinical data: exploring the role of large language models and clustering algorithms

**Authors:** Abbas S. Ali, Subi Gandhi, Syed H. Jafri, Mohammed M. Ali, Syed Y. Raza, Sulaiman Samian, James Mehaffey

PMC · DOI: 10.3389/frai.2026.1737530 · Frontiers in Artificial Intelligence · 2026-03-09

## TL;DR

This study explores how large language models can help analyze clinical data by preserving data structure and improving predictions, especially in resource-limited settings.

## Contribution

The study introduces a novel approach using LLMs for structured clinical data analysis and identifies optimal conditions for their use.

## Key findings

- LLM embeddings closely mirrored original data structures with high cosine similarity scores.
- Predictive performance improved with higher subject variable ratios, identifying distinct performance groups.
- LLMs can assist in individualized clinical decisions like optimizing surgical timing for infective endocarditis patients.

## Abstract

Large Language Models (LLMs) have shown exceptional performance in natural language processing, yet their utility in structured clinical data analysis remains relatively underexplored. This pilot study investigates whether LLM-generated embeddings can preserve the structural integrity of clinical datasets and enhance predictive modeling, particularly in resource-constrained settings.

We applied dimensionality reduction techniques such as Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), and k-means clustering to compare original data structures with those derived from LLM embeddings. Evaluation metrics included cosine similarity, area under the curve (AUC), and R2, applied across 100 synthetic datasets and two real-world clinical datasets: the UCI medical database and endocarditis patient records. We assessed multiple LLM architectures, including BERT, RoBERTa, Llama 2, and E5-small, focusing on predictive accuracy and computational efficiency.

LLM embeddings closely mirrored original data structures, with BERT achieving a cosine similarity of 0.95 on linear datasets and Llama 2 (30B) reaching 0.85 on quadratic datasets, albeit with higher computational costs. Predictive performance improved significantly across the board with increases in subject variable ratio (SVR), three groups were identified similar performance, assisted better and assisted significantly better. These groups differed based upon the equation used to generate synthetic data.

These findings highlight the potential of LLMs to enhance structured data analysis by identifying optimal conditions, such as SVR thresholds, for their practical use. The trade-off between computational cost and performance across different LLM architectures is also emphasized, suggesting the need for context-specific model selection.

LLMs can be effectively leveraged to repurpose existing clinical datasets for individualized clinical questions, such as optimizing surgical timing for patients with infective endocarditis and embolic stroke. This approach advances precision medicine and supports data-driven clinical decision-making.

## Linked entities

- **Diseases:** infective endocarditis (MONDO:0000565)

## Full-text entities

- **Diseases:** endocarditis (MESH:D004696), embolic stroke (MESH:D000083262)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13006407/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13006407/full.md

## References

59 references — full list in the complete paper: https://tomesphere.com/paper/PMC13006407/full.md

---
Source: https://tomesphere.com/paper/PMC13006407