# Optimizing clustering of CDR3 sequences using natural language processing, Word2Vec, and KMeans

**Authors:** Sanskriti Baranwal, Ricardo Avila Sanchez, Clement-Andi Edet, Erick Chastain, Inimary Toby

PMC · DOI: 10.3389/fbinf.2025.1623488 · Frontiers in Bioinformatics · 2025-10-02

## TL;DR

This paper introduces a new method using natural language processing to analyze T-cell receptor sequences, revealing immune response patterns in lung disease patients.

## Contribution

A novel NLP-based pipeline combining Word2Vec, PCA, and KMeans for clustering CDR3 sequences is introduced.

## Key findings

- Control samples showed tight, low-diversity clusters in CDR3 repertoire structure.
- ARDS patients exhibited high dispersion and diffuse clusters, indicating immune disruption.
- The framework successfully captured immune activation patterns in CDR3 space.

## Abstract

T-cell receptor (TCR) sequencing has emerged as a powerful tool for understanding adaptive immune responses, yet challenges persist in deciphering the immense diversity of Complementarity-Determining Region 3 (CDR3) sequences. This study presents a novel natural language processing (NLP)-based pipeline to cluster CDR3 sequences from TCR β-chain repertoires using Word2Vec embeddings, principal component analysis (PCA), and KMeans clustering. Focusing on Acute Respiratory Distress Syndrome (ARDS), a life-threatening inflammatory lung condition, we trained Word2Vec models on healthy controls and applied unsupervised clustering across ARDS, non-ARDS, and control datasets. Dimensionality-reduced embeddings revealed clear distinctions in repertoire structure: control samples exhibited tight, low-diversity clusters; ARDS patients showed high dispersion and numerous diffuse clusters indicative of repertoire disruption; and non-ARDS samples displayed intermediate organization. These differences suggest that immune activation states are embedded in the structural topology of the CDR3 space. Our framework successfully captured these latent patterns, offering a scalable approach to biomarker discovery. This study not only reinforces the utility of NLP in immunological analysis but also paves the way for data-driven immune monitoring in critical care and personalized diagnostics.

## Linked entities

- **Diseases:** Acute Respiratory Distress Syndrome (MONDO:0006502), ARDS (MONDO:0006502)

## Full-text entities

- **Genes:** TRBV20OR9-2 (T cell receptor beta variable 20/OR9-2 (non-functional)) [NCBI Gene 6962] {aka CDR3, TCRBV20S2, TCRBV2O, TCRBV2S2O}
- **Diseases:** ARDS (MESH:D012128), inflammatory lung condition (MESH:D016726)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12528129/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12528129/full.md

## References

36 references — full list in the complete paper: https://tomesphere.com/paper/PMC12528129/full.md

---
Source: https://tomesphere.com/paper/PMC12528129