# CANTAO: guiding clustering and annotation in single-cell RNA sequencing using average overlap

**Authors:** Christopher Thai, Amartya Singh, Daniel Herranz, Hossein Khiabanian

PMC · DOI: 10.1038/s44320-025-00176-4 · Molecular Systems Biology · 2025-12-08

## TL;DR

CANTAO is a new method for analyzing single-cell RNA sequencing data that improves clustering and annotation by using average overlap of gene rankings.

## Contribution

CANTAO introduces the average overlap metric to compare clusters based on differentially expressed gene rankings, enabling more accurate cell identity identification.

## Key findings

- CANTAO benchmarks well on truth-known datasets, accurately identifying true cell populations.
- AO-guided clustering reveals previously unresolved T-cell development stages in mouse thymocytes.
- CANTAO clarifies biological interpretation in homogeneous cell populations and detects minor subpopulations.

## Abstract

Single-cell RNA sequencing allows defining cellular identities based on transcriptional similarity using unsupervised clustering. However, a single clustering resolution may not yield groups of cells that represent both broad, well-defined populations and smaller subpopulations simultaneously. Therefore, when cell identities are not known prior to sequencing, robust comparison and annotation of inferred de novo clusters remains a challenge. Here, we introduce CANTAO, in which we propose the average overlap metric to define the distance between single-cell clusters by comparing ranked lists of differentially expressed genes in a top-weighted manner. We benchmark CANTAO in truth-known datasets comprised of similar yet distinct cell populations and show that evaluating clusters with average overlap results in a consistent, precise, and biologically meaningful recapitulation of true cell identities. We then analyze unsorted mouse thymocytes and characterize stages of T-cell development in the thymus, including minor populations of double-negative (CD4-CD8-) T cells that are difficult to confidently detect among unsorted single cells. We demonstrate that CANTAO enables robust, reproducible characterization of single-cell data and clarifies biological interpretation of underlying identities in homogeneous populations.

CANTAO is a method that uses the average overlap metric to quantify similarity of clusters inferred from unsupervised computational analyses of single-cell RNA sequencing data, based on the rankings of differentially expressed genes.

AO-guided clustering is benchmarked using multiple truth-known datasets where it identifies cell populations that show the highest correspondence to true present cell identities.AO-guided clustering of mouse thymocytes identifies previously unresolved stages of T-cell development from RNA measurements of unsorted single cells, spanning bone marrow multipotent progenitors (MPPs), stages of double negative (DN) populations from DN1 to DN4, immature single positive (ISP) and double positive (DP) cells, and eventually mature CD4 and CD8 T-cells.

AO-guided clustering is benchmarked using multiple truth-known datasets where it identifies cell populations that show the highest correspondence to true present cell identities.

AO-guided clustering of mouse thymocytes identifies previously unresolved stages of T-cell development from RNA measurements of unsorted single cells, spanning bone marrow multipotent progenitors (MPPs), stages of double negative (DN) populations from DN1 to DN4, immature single positive (ISP) and double positive (DP) cells, and eventually mature CD4 and CD8 T-cells.

CANTAO is a method that uses the average overlap metric to quantify similarity of clusters inferred from unsupervised computational analyses of single-cell RNA sequencing data, based on the rankings of differentially expressed genes.

## Linked entities

- **Proteins:** CD4 (CD4 molecule), CD8A (CD8 subunit alpha)
- **Species:** Mus musculus (taxon 10090)

## Full-text entities

- **Genes:** Cd4 (CD4 antigen) [NCBI Gene 12504] {aka L3T4, Ly-4}
- **Chemicals:** CANTAO (-)
- **Species:** Mus musculus (house mouse, species) [taxon 10090]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12954110/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12954110/full.md

---
Source: https://tomesphere.com/paper/PMC12954110