# Optimization of clustering parameters for single-cell RNA analysis using intrinsic goodness metrics

**Authors:** Nicolina Sciaraffa, Antonino Gagliano, Luigi Augugliaro, Claudia Coronnello

PMC · DOI: 10.3389/fbinf.2025.1562410 · Frontiers in Bioinformatics · 2025-06-11

## TL;DR

This paper explores how to optimize clustering parameters in single-cell RNA analysis using intrinsic metrics to improve accuracy.

## Contribution

The study introduces a method to predict clustering accuracy using intrinsic goodness metrics across different datasets and algorithms.

## Key findings

- Using UMAP for neighborhood graphs and increasing resolution improves clustering accuracy.
- Within-cluster dispersion and Banfield-Raftery index are effective proxies for clustering accuracy.
- The number of principal components should be tested due to its sensitivity to data complexity.

## Abstract

The accurate clustering of cell subpopulations is a crucial aspect of single-cell RNA sequencing. The ability to correctly subdivide cell subpopulations hinges on the efficacy of unsupervised clustering. Despite the advancements and numerous adaptations of clustering algorithms, the correct clustering of cells remains a challenging endeavor that is dependent on the data in question and on the parameters selected for the clustering process. In this context, the present study aimed to predict the accuracy of clustering methods when varying different parameters by exploiting the intrinsic goodness metrics.

This study utilized three datasets, each originating from a distinct anatomical district and with a ground truth cell annotation. Moreover, the investigation employed two clustering methods: the Leiden and the Deep Embedding for Single-cell Clustering (DESC) algorithm. Firstly, a robust linear mixed regression model has been implemented in order to analyze the impact of clustering parameters on the accuracy. Consequently, fifteen intrinsic measures have been calculated and used to train an ElasticNet regression model in both intra- and cross-dataset approaches to evaluate the possibility of predicting the clustering accuracy.

The first-order interactions demonstrated that the use of the UMAP method for the generation of the neighborhood graph and an increase in resolution has a beneficial impact on accuracy. The impact of the resolution parameter is accentuated by the reduced number of nearest neighbors, resulting in sparser and more locally sensitive graphs, which better preserve fine-grained cellular relationships. Furthermore, it is advisable to test different numbers of principal components, given that this parameter is highly affected by data complexity. This procedure has enabled the effective prediction of clustering accuracy through the utilization of intrinsic metrics. The findings demonstrated that the within-cluster dispersion and the Banfield-Raftery index could be effectively used as proxies for accuracy, for an immediate comparison of different clustering parameter configurations.

## Full-text entities

- **Genes:** SELE (selectin E) [NCBI Gene 6401] {aka CD62E, ELAM, ELAM1, ESEL, LECAM2, selectin-e}, ICAM1 (intercellular adhesion molecule 1) [NCBI Gene 3383] {aka BB2, CD54, P3.58}, CLDN5 (claudin 5) [NCBI Gene 7122] {aka AWAL, BEC1, CPETRL1, TMDVCF, TMVCF}, PECAM1 (platelet and endothelial cell adhesion molecule 1) [NCBI Gene 5175] {aka CD31, CD31/EndoCAM, GPIIA', PECA1, PECAM-1, endoCAM}
- **Diseases:** inflammatory (MESH:D007249), tumor (MESH:D009369), EC (MESH:D005955)
- **Chemicals:** DESC (-)
- **Species:** Mus musculus (house mouse, species) [taxon 10090], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12187673/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12187673/full.md

## References

57 references — full list in the complete paper: https://tomesphere.com/paper/PMC12187673/full.md

---
Source: https://tomesphere.com/paper/PMC12187673