scDBic: a novel deep learning-based biclustering algorithm for analyzing scRNA-seq data

Xiaoqi Tang; Caihua Liu; Chaowang Lan

PMC · DOI:10.1093/bioinformatics/btag095·February 26, 2026

scDBic: a novel deep learning-based biclustering algorithm for analyzing scRNA-seq data

Xiaoqi Tang, Caihua Liu, Chaowang Lan

PDF

Open Access

TL;DR

scDBic is a new deep learning method for analyzing single-cell RNA data that better identifies cell groups and their key genes.

Contribution

scDBic introduces a novel deep learning-based biclustering algorithm with improved cell clustering and key gene identification.

Findings

01

scDBic improves cell clustering performance compared to traditional and biclustering algorithms.

02

The method identifies key genes for each cell group using a reverse strategy.

03

The algorithm is freely available and outperforms existing techniques in scRNA-seq analysis.

Abstract

Clustering single-cell RNA sequencing (scRNA-seq) data plays a vital role in the study of cellular heterogeneity. Many algorithms have been developed to cluster scRNA-seq data. However, traditional clustering algorithms often fail to capture local consistency, whereas biclustering algorithms suffer from issues such as cell loss, poor adaptability to high-dimensional data, and iterative selection challenges. In this paper, we introduce scDBic, a novel deep learning-based biclustering algorithm specialized for scRNA-seq data. It comprises three main steps: cell clustering with a deep autoencoder, gene clustering, and identification of key gene clusters using the reverse strategy. The key idea is that the deep autoencoder captures the main information of gene expression and the reverse strategy identifies the key genes of cell groups. Therefore, cell clustering performance can be…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes6

Cdx2 H3c7 Sox21 Carm1 CD8A Pou5f1

Proteins6

Species2

Mus musculus(house mouse · species)Homo sapiens(human · species)

Cell lines1

E-MTAB-3321— Homo sapiens (Human) · Primary peritoneal serous papillary adenocarcinoma · Cancer cell line

Chemicals1

E-MTAB-3321

Diseases4

autoimmune lymphoproliferative syndrome NMI SNN ARI

Figures5

Click any figure to enlarge with its caption.

The workflow of cell clustering phase. Step a, b, c, and d respectively represent preprocessing module, deep autoencoder module, constructing the SNN graph module, and Walktrap clustering for cells module.

The workflow of gene clustering. Step e represents K-means clustering for genes. Cell clustering phase and step f jointly constitute the process of identifying key candidate biclusters in reverse strategy.

F1 scores of our algorithm scDBic and 27 state-of-the-art algorithms on five scRNA-Seq datasets.

KEGG pathway enrichment results of biclusters identified by scDBic in the E-MTAB-3321 dataset.

Tables1

Table 1. The performance of our method and Walktrap-based methods.a

Approaches	E-MTAB-3321	GSE45719	GSE52529	E-MTAB-2600	Metrics
scDBic	0.923	0.856	0.642	0.982	F1
Walktrap + Autoencoder	0.899	0.822	0.634	0.973
Walktrap + Key Biclusters + PCA	0.954	0.801	0.634	0.947
Walktrap	0.454	0.523	0.504	0.841
scDBic	0.981	0.638	0.288	0.958	ARI
Walktrap + Autoencoder	0.971	0.560	0.325	0.934
Walktrap + Key Biclusters + PCA	0.923	0.511	0.263	0.866
Walktrap	0.204	0.387	0.235	0.639
scDBic	0.950	0.776	0.313	0.925	NMI
Walktrap + Autoencoder	0.928	0.611	0.345	0.883
Walktrap + Key Biclusters + PCA	0.896	0.742	0.299	0.796
Walktrap	0.267	0.611	0.277	0.650

Equations6

Funding2

—Guangxi Science and Technology Base and Talent Special Fund
—Humanities and Social Sciences Youth Foundation, Ministry of Education of the People’s Republic of China

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSingle-cell and spatial transcriptomics · Genomics and Phylogenetic Studies · Gene expression and cancer classification

Full text

1 Introduction

A large number of biological processes are accomplished through the collective efforts of cells within organisms (Celiker and Gore 2013). Understanding how cells partition tasks, coordinate activities, and manage biological processes remains at a rudimentary stage. Single-cell RNA sequencing (scRNA-seq) technology enables researchers to precisely dissect biological processes at the single-cell level (Rosenberg et al. 2018).

However, the analysis of scRNA-seq data faces several challenges. Owing to cell heterogeneity, high cell dropout rates, low expression genes, technical biases, and batch effects, scRNA-seq data is high-dimensional, sparse, and noisy, affecting subsequent analysis.

Clustering is a key computational method for analyzing scRNA-seq data and is an effective method for studying cellular heterogeneity, which provides insights into disease mechanisms. For example, previous studies have confirmed that CD8+ T cells are critical for autoimmune lymphoproliferative syndromes (Eberhardt et al. 2021). However, intervention therapy targeting all CD8+ T cells indiscriminately could pose significant harm to the human body, potentially outweighing the risks of the disease. Therefore, identifying pathogenic CD8+ T cells is of utmost importance in the treatment of autoimmune lymphoproliferative syndrome. In addition, clustering can reveal disease-causing mechanisms (Li et al. 2021b), identify different cell types and functional states (Zheng et al. 2017), discover rare cell populations, reconstruct cell developmental trajectories (Trapnell et al. 2014), and build spatial models of complex tissues (Satija et al. 2015), etc.

Traditional clustering methods apply global gene information to determine different cell types, making it difficult to capture local consistency (Hussain and Ramazan 2016). This phenomenon refers to local gene-cell co-regulatory patterns where specific subsets of genes are co-regulated only within specific subsets of cells. Existing studies have demonstrated that only a small subset of genes play a role in specific cell types or states within scRNA-seq data (Sun et al. 2023). Therefore, the traditional clustering method may not be suitable for clustering the scRNA-seq data. Biclustering methods cluster both cells and genes simultaneously and cannot only identify different cell types but also discover potential key genes that are active in different cell types.

Biclustering algorithms can be broadly categorized into greedy methods (Xue et al. 2018, Torres-Jimenez and Perez-Torres 2019), divide-and-conquer methods (Bandyopadhyay and Mallik 2018), exhaustive enumeration methods (Rodriguez-Baena et al. 2011, Serin and Vingron 2011), factorization-based methods (Moran et al. 2021), dynamic programming-based methods (Orzechowski et al. 2018), and graph-based methods (Shi and Huang 2017, Dong and Yuan 2020, Xie et al. 2020). Existing biclustering methods may have the following two issues. Firstly, there is a scarcity of biclustering methods specifically designed for scRNA-seq data, and they cannot effectively handle the inherent challenges posed by the high dimensionality, sparsity, and noise characteristics of scRNA-seq data. Secondly, current biclustering methods often fail to effectively extract key gene clusters and lack proper explanations.

To address the aforementioned issues, we initially explored the integration of deep learning with biclustering in our previous work, scDCABC (Tang and Lan 2024). While scDCABC successfully introduced an iterative framework, it relied on a passive trial-and-error strategy that often-incurred computational redundancy. Building upon this foundation, we present scDBic, an optimized and novel deep learning-based biclustering algorithm. scDBic effectively uncovers hidden patterns in gene expression matrices by iteratively combining deep cell clustering and gene clustering through three synergistic phases: the cell clustering phase, the gene clustering judgment phase, and the gene clustering phase.

In the cell clustering phase, deep neural layers are employed to approximate non-linear transformations, automatically mapping the gene expression matrix to a lower-dimensional latent space. Subsequently, a Shared Nearest Neighbor (SNN) graph is constructed to refine clustering, addressing the challenge of processing high-dimensional data. The gene clustering judgment phase applies a recursive mechanism akin to hierarchical clustering that adapts automatically to the data, effectively solving the issue of determining optimal iteration counts. In the gene clustering phase, key gene clusters are identified by comparing intra-cluster similarity; crucially, all cell information within these clusters is retained, preventing the cell loss typical of traditional biclustering. By capturing local gene-cell co-regulatory patterns instead of relying on global distances, our method exhibits robustness against the “curse of dimensionality” and noise (Xie et al. 2016).

2 Materials and methods

The scDBic algorithm for biclustering scRNA-seq data comprises three phases: cell clustering phase, gene clustering judgment phase, and gene clustering phase. The flowchart of this algorithm is presented in Fig. 1. Cell clustering and gene clustering are the core phases that form the body biclustering. The subsequent sections will provide a more detailed description of each phase.

The flowchart of scDBic.

2.1 Cell clustering phase

Cell clustering phase contains four modules: preprocessing module, deep autoencoder module, constructing the SNN graph module, and Walktrap clustering for cells module. The framework for cell clustering is illustrated in Fig. 2.

The workflow of cell clustering phase. Step a, b, c, and d respectively represent preprocessing module, deep autoencoder module, constructing the SNN graph module, and Walktrap clustering for cells module.

For clarity, in each iteration, we consistently represent the preprocessed gene expression profile as $[eqn]$ , with the number of cells and genes denoted as n and m, respectively.

2.1.1 Preprocessing module

The data preprocessing is mainly performed by log-normalization and standardization, corresponding to step a in Fig. 2. The log1p function is applied to decrease the skew of gene expression data (Khan et al. 2022). Then, Z-Score normalization is employed to normalize gene expression values.

2.1.2 Deep autoencoder module

The deep autoencoder, a specialized neural network for dimensionality reduction, is applied to address the challenges posed by the “curse of dimensionality” in scRNA-seq data (Hinton and Salakhutdinov 2006). It comprises input layer, encoding layer, bottleneck layer, decoding layer, and output layer (Vincent et al. 2008), corresponding to step b in Fig. 2.

Input layer: The input layer accepts the preprocessed gene expression profile X with m (the numbers of genes) neurons. Encoding layers: The encoding layers compress the high-dimensional X into a low-dimensional bottleneck. It consists of three fully connected layers. The numbers of the neurons in each layer are 1024, 512, and 256, respectively, and the activation function in each layer is Scaled Exponential Linear Unit (SELU). Bottleneck layer: The bottleneck layer is the actual output of the deep autoencoder module, which consists of 128 neurons and uses SELU as the activation function, captures the essence of the gene expression profile. For clarity, in each iteration, the downscaled gene expression profile of the bottleneck layer is denoted as $[eqn]$ . Decoding layers: Mirroring the encoding layers, the decoding layers reconstruct $[eqn]$ back as X. This reconstruction is facilitated by three fully connected layers with neuron configurations of 256, 512, and 1024, employing SELU to maintain consistent activation functionality across the network. Output layer: The output layer obtains the reconstructed X with m neurons, using Sigmoid as the activation function. It is important to note that this is not the final output of the scDBic’s deep autoencoder module, but rather the reconstructed approximation of the input.

We employ the scaled cosine error (SCE) loss to enforce directional similarity between the input $[eqn]$ and reconstructed $[eqn]$ profiles:

[eqn]

where N is the batch size, and $[eqn]$ denotes the $[eqn]$ norm. This formulation prioritizes relative expression patterns over absolute magnitudes. The autoencoder involves two key hyperparameters: the choice of optimizer and number of training epochs. Following extensive hyperparameter tuning experiments (see Section 2 and Chapter 2, available as supplementary data at Bioinformatics online), we set RMSprop as the default optimizer and recommend an epoch range between 70 and 150, with 100 epochs used by default.

2.1.3 Constructing the SNN graph module

The SNN graph is a structure constructed based on the counts of nearest neighbors shared among the data points (Xu and Su 2015), corresponding to step c in Fig. 2. In this study, the downscaled gene expression profile $[eqn]$ is applied to construct a cell–cell SNN graph $[eqn]$ , where node V is the set of cells and E is the set of all edges. Given any two cells $[eqn]$ and $[eqn]$ , if the Euclidean distance from $[eqn]$ to $[eqn]$ ranks among the top l shortest distances from $[eqn]$ to other cells (with l set to 20 by default, as determined through hyperparameter tuning experiments; see Section 2 and Chapter 2, available as supplementary data at Bioinformatics online for details), $[eqn]$ is deemed the nearest neighbor of $[eqn]$ . When $[eqn]$ is the nearest neighbor of $[eqn]$ and $[eqn]$ is also the nearest neighbor of $[eqn]$ , we use an edge to connect cell $[eqn]$ and $[eqn]$ . The weight of the edge is calculated using the following formula:

[eqn]

where $[eqn]$ and $[eqn]$ represent the nearest neighbors of cells $[eqn]$ and $[eqn]$ , respectively.

2.1.4 Walktrap clustering for cells module

The Walktrap clustering method aims to cluster the SNN graph, corresponding to step d in Fig. 2.

Initially, each cell (vertex) in the SNN graph is treated as an independent community, resulting in a starting partition with N communities (where N is the number of cells). The distance between each pair of communities is then calculated to identify the pair with the smallest distance. The distance is defined as:

[eqn]

where $[eqn]$ and $[eqn]$ denote the probabilities of reaching community $[eqn]$ after t steps of a random walk starting from communities $[eqn]$ and $[eqn]$ , respectively. Here, $[eqn]$ represents the edge weight between community $[eqn]$ and $[eqn]$ in the SNN graph, calculated using the same method as $[eqn]$ during SNN construction, and $[eqn]$ is the total sum of edge weights from community $[eqn]$ to all other communities.

After identifying the two closest communities, the modularity of the current community structure is computed. The two communities are then merged, and the modularity after merging is recalculated; if the modularity improves, the merge is retained; otherwise, it is discarded. Modularity quantifies the reduction in total random walk distances across the graph, serving as an indicator of the quality of community divisions. The modularity increases between any two communities $[eqn]$ and $[eqn]$ is given by:

[eqn]

where $[eqn]$ denotes the elements of the adjacency matrix of graph G, $[eqn]$ and $[eqn]$ are the total edge weights associated with nodes i and j, respectively, and w is the total edge weight sum of all nodes. A larger $[eqn]$ value indicates that the merged community structure is of higher quality than before.

This merging process continues iteratively until no further merging results in a significant improvement in modularity score. The number of communities after convergence is taken as the final number of clusters K determined by the Walktrap method.

Following the results of Walktrap clustering, we partition the original gene expression matrix in accordance with the Walktrap clustering labels to ensure that all gene information is preserved within each cell cluster. This preparation facilitates the next phase of gene clustering, maintaining the integrity of genetic profiles within each cluster. The partitioned cell clusters set is also the output from this phase, denoted by $[eqn]$ , where r indicates the number of cell clusters formed. It is important to note that each element of $[eqn]$ will subsequently undergo further gene clustering judgment phase and gene clustering phase. For clarity, we use $[eqn]$ as an example in subsequent descriptions, where $[eqn]$ represents the number of cells in $[eqn]$ cluster.

2.2 Gene clustering judgment phase

We design a recursive iterative mechanism similar to hierarchical clustering incorporating an “Explicit Logic Gate.” Unlike the passive “compute-first, check-later” strategy employed in previous deep biclustering methods (e.g. scDCABC), where the reverse strategy is indiscriminately applied to all clusters causing resource wastage, scDBic’s judgment phase acts as an active gatekeeper. It evaluates the topological necessity of further decomposition before triggering the computationally expensive gene clustering and reverse strategy. For any given $[eqn]$ , scDBic sequentially checks whether it satisfies the gene clustering judgment condition. If no new cluster is formed ( $[eqn]$ ) or the number of newly generated cells is <6 ( $[eqn]$ ), the condition is considered satisfied. If the condition is not met, the algorithm continues iterating until all elements in $[eqn]$ satisfy the condition, at which point the scDBic algorithm terminates. The detailed mathematical description is provided in Chapter 3, available as supplementary data at Bioinformatics online.

The final output of scDBic includes the cell labels from the biclusters and the key bicluster corresponding to each label. The key bicluster is the result of the last iteration of gene clustering before the judgment condition is satisfied.

2.3 Gene clustering phase

The gene clustering phase corresponds to the “K-means clustering for genes” module and the “Identifying key candidate biclusters via reverse strategy” module, aiming to cluster key genes into the same group and to identify, through a reverse strategy, which group contains the critical gene information. Compared with traditional clustering, the added gene clustering phase can, on the one hand, identify local patterns of gene-cell co-regulation and uncover differential expression of the same gene in different subpopulations; on the other hand, it can eliminate the influence of irrelevant genes on subsequent cell clustering.

2.3.1 K-means clustering for genes

Given a cell cluster, if the judgment condition is not satisfied, this cell cluster can proceed to the gene clustering phase. This phase aims to filter out key genes through clustering, thus enabling more refined sub-cell clusters. Different types of cell clusters can be distinguished using a small subset of genes. These small subsets of genes are called key genes, and identifying these key genes helps identify cell clusters. Therefore, clustering is applied to identify key genes. The process of gene clustering phase is shown in Fig. 3.

The workflow of gene clustering. Step e represents K-means clustering for genes. Cell clustering phase and step f jointly constitute the process of identifying key candidate biclusters in reverse strategy.

For a given cell cluster $[eqn]$ , we proceed to perform gene clustering using K-means method to derive k gene clusters (with k set to 5 by default, as determined through hyperparameter selection experiments; see Section 2 and Chapter 2, available as supplementary data at Bioinformatics online for details). These gene clusters can also be considered as biclusters, denoted as $[eqn]$ , corresponding to step e in Fig. 3. Not all biclusters contain key genes. Consequently, it is essential to identify the biclusters that contain key genes.

2.3.2 Identifying key candidate biclusters in reverse strategy

We propose a reverse strategy for identifying the key candidate biclusters. Given a bicluster $[eqn]$ , we apply the cell clustering method proposed in Section Cell Clustering Phase, corresponding to the combination of the cell clustering phase and step f in Fig. 3, to obtain s cell groups. The inner distance (denoted as D) of bicluster $[eqn]$ is calculated as follow:

[eqn]

where $[eqn]$ denotes the inner distance of the gth cell group. Given a cell group g, $[eqn]$ represents the number of cells within a cell group, $[eqn]$ represents the number of genes within a cell cluster, and the inner distance (denoted as insim) of cell group g is calculated by the following formula:

[eqn]

where $[eqn]$ and $[eqn]$ represent the expression levels of the gth gene in cell i and cell j, respectively.

The smaller the inner distance of a bicluster, the more important that bicluster is considered. By comparing the inner distances of $[eqn]$ , the key candidate bicluster is identified as the one with the smallest inner bicluster distance, assumed here to be $[eqn]$ . We demonstrate the significance of identifying the key biclusters in the subsequent Section 3.3.

3 Result

3.1 Datasets and evaluation metrics

In this study, we select five well-annotated scRNA-seq datasets: E-MTAB-3321, E-MTAB-2600, GSE45719, GSE52529, and GSE87544. These datasets exhibit diverse scales, comprising between hundreds and ten thousand cells and tens of thousands of genes. To measure the performance of the algorithms, we have adopted the Macro-F1 Score (hereafter referred to as F1), Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI) as evaluation metrics. The detailed formulations of these metrics are provided in Chapter 1, available as supplementary data at Bioinformatics online.

3.2 Hyperparameter tuning

Based on the results of the hyperparameter tuning experiments, the default settings for scDBic are as follows: the number of training epochs for the autoencoder is set to 100, the optimizer of the autoencoder is RMSprop, the number of nearest neighbors l is 20, and the value of k in the K-means gene clustering is 5. Detailed results of the hyperparameter tuning experiments can be found in Chapter 2, available as supplementary data at Bioinformatics online.

3.3 Deep autoencoder module and identifying key candidate biclusters are two critical steps in clustering scRNA-seq data

As discussed in Section 2.1.2 and Section 2.3.2, the autoencoder mitigates noise by mapping profiles to low-dimensional spaces, while key bicluster identification reveals cluster-specific genes. To validate these components, we compared three variants: “Walktrap” (baseline); “Walktrap + Autoencoder” (iterative autoencoder downscaling without gene clustering); and “Walktrap + Key Biclusters + PCA”. In the third variant, PCA reduces dimensions to 128—matching the autoencoder’s bottleneck—to ensure a fair comparison while utilizing our framework’s gene clustering strategy.

From Table 1, we observe that scDBic algorithm shows a 3.6%–77.7% improvement over the traditional Walktrap algorithm across the F1, ARI, and NMI metrics, which demonstrates the superiority of scDBic algorithm.

From the analysis of lines 4, 8, and 12 in Table 1, the “Walktrap + Key Biclusters + PCA” algorithm improves the results by 2.2%–71.9% compared to the traditional Walktrap method. In addition, the primary difference between scDBic and “Walktrap + Key Biclusters + PCA” lies in their dimensionality reduction techniques of deep autoencoder versus PCA. Among the 12 results from three evaluation metrics across four datasets, scDBic with autoencoder dimensionality reduction outperforms “Walktrap + Key Biclusters + PCA” in 11 results, with improvements ranging from 0.8% to 12.9%. Only the F1 score of the E-MTAB-3321 dataset underperformed by 3.1%, underscoring the superior adaptability of the autoencoder to scRNA-seq data.

Similarly, according to lines 3, 7, and 11 in Table 1, the “Walktrap + Autoencoder” algorithm improves the results by up to 76.7% compared to the traditional Walktrap algorithm. Furthermore, comparing the scDBic algorithm to the “Walktrap + Autoencoder” algorithm, scDBic includes a step for identifying key candidate biclusters, while “Walktrap + Autoencoder” algorithm does not. In the GSE52529 dataset, “Walktrap + Autoencoder” performs better by 3.7% in ARI and 3.2% in NMI. However, scDBic shows improvements ranging from 1% to 16.5% over “Walktrap + Autoencoder” in the other datasets. These results demonstrate that identifying the key candidate biclusters can help improve the accuracy of the algorithm and highlight the necessity of this step.

3.4 Comparing the performance of our algorithm with other clustering and biclustering algorithms

We compared scDBic with six state-of-the-art biclustering methods spanning various principles: graph-based [BiSNN-Walk (Shi and Huang 2017)], information-theoretic [QUBIC2 (Xie et al. 2020)], matrix factorization [SSLB (Moran et al. 2021)], statistical [GiniClust3 (Dong and Yuan 2020)], common-subsequence [Runibic (Orzechowski et al. 2018)], and deep learning [scDCABC (Tang and Lan 2024)]. We also evaluated 21 clustering algorithms categorized into: deep learning [ADClust (Zeng et al. 2022), scGNN2.0 (Gu et al. 2022), scSSA (Zhao et al. 2022), AttentionAE-sc (Li et al. 2023), DeepScena (Lei et al. 2023), scDECL (Gan et al. 2023), scDSC (Gan et al. 2022), SCLEGA (Liu et al. 2024)]; graph and spectral methods [Seurat (Hao et al. 2021), ScGSLC (Li et al. 2021a), MPSSC (Park and Zhao 2018)]; consensus and hierarchical clustering [SC3 (Kiselev et al. 2017), SCENA (Cui et al. 2021), scHFC (Wang et al. 2022)]; classical clustering paired with PCA/UMAP; and imputation-assisted methods [GE-Impute (Wu and Zhou 2022) with Seurat]. All methods were implemented using default parameters. A comprehensive summary of all compared methods is presented in Table 2, available as supplementary data at Bioinformatics online.

Three evaluation metrics, F1, ARI, and NMI, are employed to assess the clustering performance of these algorithms on five public datasets. The results of F1 are shown in Fig. 4, while the results and analysis of ARI and NMI are provided in Chapter 4, available as supplementary data at Bioinformatics online.

F1 scores of our algorithm scDBic and 27 state-of-the-art algorithms on five scRNA-Seq datasets.

In terms of the F1 metric, scDBic achieves the best performance across four of the five datasets. Specifically, on the E-MTAB-3321 dataset, the improvement range of scDBic is between 0% and 61.3%; for the GSE45719 dataset, the improvement range is between 8.8% and 66.8%; for the GSE52529 dataset, the improvement range is between 1.8% and 52.8%; and on the E-MTAB-2600 dataset, the improvement range is between 1.5% and 77.1%. Notably, since nearly half of the competing algorithms were unable to handle the large-scale GSE87544 dataset, to ensure fairness, the average F1 scores were calculated based on the other four datasets. On the GSE87544 dataset, scDBic achieves an F1 score of 0.503, ranking third. From an overall perspective, the average F1 score of scDBic is 0.851, which is significantly higher than that of the other algorithms, exceeding their average F1 scores by 0.064 to 0.569, indicating a notable improvement.

Overall, the performance of the scDBic algorithm is superior to that of traditional clustering and biclustering algorithms.

3.5 Category distribution analysis

We visualized the clustering results of scDBic and other baseline methods across all datasets involved in the experiments to evaluate the spatial structure of the clustering. For comparison, the analysis focused on two aspects: clustering boundaries and cell integrity. The results show that scDBic yields clearer and more distinct clustering boundaries than other methods. Moreover, scDBic preserves full cell coverage without the cell loss commonly observed in many biclustering methods, thereby maintaining better cell integrity. Detailed results can be found in Chapter 5, available as supplementary data at Bioinformatics online.

3.6 Time complexity analysis of scDBic

The time complexity of scDBic is approximately $[eqn]$ , with detailed derivation provided in Chapter 6, available as supplementary data at Bioinformatics online. The time complexities of all baseline algorithms are summarized in the sixth column of Table 2, available as supplementary data at Bioinformatics online. As shown in the fourth columns of Table 2, available as supplementary data at Bioinformatics online, scDBic ranks at a middle level among all clustering and biclustering methods in terms of runtime and completes the biclustering task within an acceptable timeframe. Combined with its strong accuracy and robustness across multiple datasets, scDBic demonstrates a well-balanced performance in terms of efficiency, stability, and practical applicability.

3.7 The biological significance of biclusters

The biclusters identified by scDBic algorithm not only reveal the cell groups of the tissue but also identify the key genes of cell groups. Here, we analyze E-MTAB-3321 as an example. The analysis of KEGG pathways for the GSE45719, GSE52529, and E-MTAB-2600 datasets is presented in Chapter 7, available as supplementary data at Bioinformatics online.

The E-MTAB-3321 dataset comprises 124 mouse embryonic cells and 41 480 genes (Goolam et al. 2016). It involves isolation of individual cells from consecutive developmental stages of pre-implantation mouse embryos and sequencing of their transcriptomes using the Smart-seq2 single-cell RNA sequencing protocol across five stages: 2-cell, 4-cell, 8-cell, 16-cell, and 32-cell stages.

Mubeen Goolam’s research (Goolam et al. 2016) shed light on the significant impact of gene expression heterogeneity on the cellular fate during the 4-cell stage of mouse embryonic development. He observed that the Sox21 gene displayed proximal binding to Oct4 and regulated the expression of the differentiation master regulator Cdx2, exhibiting the strongest heterogeneity during the 4-cell stage. This suggests a pivotal role of the Sox21 gene in orchestrating cellular differentiation during the early stages of embryonic development. Further studies reveal that the expression of Sox21 is regulated by CARM1, which can methylate arginine 26 of histone H3 (H3R26). Therefore, the CARM1 gene is also a key gene in 4-cell stage.

Both the Sox21 and CARM1 genes appear in the representative 4-cell bicluster generated by the scDBic algorithm. Additionally, KEGG pathway enrichment analysis of this bicluster (Fig. 5) shows that the top three significantly enriched pathways—“Neuroactive ligand-receptor interaction,” “Cytokine-cytokine receptor interaction,” and “MAPK signaling pathway”—are closely related to mouse embryonic development. They jointly regulate critical processes such as cell growth, immune response, and fate determination. These results confirm that scDBic successfully identifies key genes defining specific cell subpopulations.

KEGG pathway enrichment results of biclusters identified by scDBic in the E-MTAB-3321 dataset.

4 Discussion

In this study, we propose scDBic, a specialized deep learning-based biclustering algorithm tailored for scRNA-seq data analysis. Conceptually, scDBic optimizes the deep learning component by employing a streamlined autoencoder architecture. Compared to complex probabilistic models, our design avoids heavy parameter estimation while effectively extracting key gene expression features. This optimization, combined with the active logic gate and reverse selection strategy, enables scDBic to accurately identify representative biclusters. Experimental results across diverse datasets demonstrate that scDBic outperforms six biclustering and twenty-one clustering methods. While our recursive strategy entails a moderate computational cost compared to simple matrix factorization, it offers a superior trade-off by ensuring high clustering accuracy and biological interpretability. Looking ahead, we acknowledge the emergence of Single-cell Large Language Models (scLLMs) which offer powerful universal representations. Future work will explore integrating scLLM-derived embeddings into the scDBic framework to leverage the strengths of both paradigms for uncovering intrinsic patterns in massive transcriptomic datasets.

Supplementary Material

btag095_Supplementary_Data

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bandyopadhyay S , Mallik S. Integrating multiple data sources for combinatorial marker discovery: a study in tumorigenesis. IEEE/ACM Trans Comput Biol Bioinform 2018;15:673–87.28114033 10.1109/TCBB.2016.2636207 · doi ↗ · pubmed ↗
2Celiker H , Gore J. Cellular cooperation: insights from microbes. Trends Cell Biol 2013;23:9–15.22999189 10.1016/j.tcb.2012.08.010 · doi ↗ · pubmed ↗
3Cui Y , Zhang S, Liang Y et al Consensus clustering of single-cell RNA-seq data by enhancing network affinity. Brief Bioinform 2021;22:bbab 236.34160582 10.1093/bib/bbab 236PMC 8574980 · doi ↗ · pubmed ↗
4Dong R , Yuan G-C. Gini Clust 3: a fast and memory-efficient tool for rare cell type identification. BMC Bioinformatics 2020;21:158.32334526 10.1186/s 12859-020-3482-1PMC 7183612 · doi ↗ · pubmed ↗
5Eberhardt CS , Kissick HT, Patel MR et al Functional HPV-specific PD-1+ stem-like CD 8 T cells in head and neck cancer. Nature 2021;597:279–84.34471285 10.1038/s 41586-021-03862-z PMC 10201342 · doi ↗ · pubmed ↗
6Gan Y , Chen Y, Xu G et al Deep enhanced constraint clustering based on contrastive learning for sc RNA-seq data. Brief Bioinform 2023;24:bbad 222.37313714 10.1093/bib/bbad 222 · doi ↗ · pubmed ↗
7Gan Y , Huang X, Zou G et al Deep structural clustering for single-cell RNA-seq data jointly through autoencoder and graph neural network. Brief Bioinform 2022;23:bbac 018.35172334 10.1093/bib/bbac 018 · doi ↗ · pubmed ↗
8Goolam M , Scialdone A, Graham SJ et al Heterogeneity in Oct 4 and Sox 2 targets biases cell fate in 4-cell mouse embryos. Cell 2016;165:61–74.27015307 10.1016/j.cell.2016.01.047PMC 4819611 · doi ↗ · pubmed ↗