GeneSurfer enables transcriptome-wide exploration and annotation of gene co-expression modules in 3D spatial transcriptomics data
Chang Li, Julian Thijssen, Thomas Kroes, Ximaine van der Burg, Louise van der Weerd, Thomas Höllt, Boudewijn Lelieveldt

TL;DR
GeneSurfer is a tool that helps researchers explore gene co-expression patterns in 3D spatial transcriptomics data, offering insights into biological functions and supporting hypothesis generation.
Contribution
GeneSurfer introduces an interactive interface for localized transcriptome-wide gene co-expression analysis in 3D spatial transcriptomics data.
Findings
GeneSurfer identifies and annotates localized co-expression patterns in spatial transcriptomics data.
The tool supports multi-slice 3D rendering and on-the-fly Gene Ontology term annotation of gene clusters.
It is demonstrated to be effective using data from the Allen Brain Cell Atlas.
Abstract
Gene co-expression provides crucial insights into biological functions, however, there is a lack of exploratory analysis tools for localized gene co-expression in large-scale datasets. We present GeneSurfer, an interactive interface designed to explore localized transcriptome-wide gene co-expression patterns in the 3D spatial domain. Key features of GeneSurfer include transcriptome-wide gene filtering and gene clustering based on spatial local co-expression within transcriptomically similar cells, multi-slice 3D rendering of average expression of gene clusters, and on-the-fly Gene Ontology term annotation of co-expressed gene sets. Additionally, GeneSurfer offers multiple linked views for investigating individual genes or gene co-expression in the spatial domain at each exploration stage. Demonstrating its utility with both spatially resolved transcriptomics and single-cell RNA…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSingle-cell and spatial transcriptomics · Cell Image Analysis Techniques · Gene Regulatory Network Analysis
Introduction
The rapid emergence of single-cell RNA sequencing (scRNA-seq) and spatially resolved transcriptomics (SRT) has led to routine generation of high-dimensional, spatially resolved datasets at the whole-organ scale.1^,^2^,^3 These datasets offer high informational content but require scalable computational methods for effective data exploration.
From spatially sampled bulk sequencing we already know that spatially co-varying genes tend to be functionally related.4 The “guilt by association” principle suggests that genes which are co-expressed are likely to share function roles, especially within transcriptomically similar cells.
Methods for grouping genes into co-expression modules include clustering, matrix decomposition, network inference-based approaches, and others.5 The most common approach is to employ clustering algorithms for gene set identification. However, one major limitation is that clustering methods group genes using the entire dataset as input. It might overlook spatially localized co-expression patterns within specific cell subsets, since genes can be locally co-expressed at certain locations in the brain but not at others. Thus, methods that can detect spatially localized co-expression could provide biological specificity of co-expression patterns. Among the various module detection methods, biclustering is able to go beyond this limitation by simultaneously capturing gene co-expression and sample subsets across the data matrix.6^,^7 But similar to other module detection methods, biclustering algorithms are script-based and require predefined parameters.5 This introduces uncertainty and leads to the need for testing across different settings. Therefore, there still exists a gap to identify the local gene co-expression in a flexible and interactive manner.
Besides methods that directly identify local gene co-expression, we can also apply the clustering algorithm to a subset of a specific cell type, to compute co-expression within transcriptomically similar cells. However, cell clustering algorithms divide cells into disjoint classes according to transcriptional similarity, binarizing potentially gradual cell type transitions and gradients into distinct subtypes.3 This process requires the user to decide beforehand which level of cluster resolution is of interest, limiting the ability to explore the complete parameter space. Currently, there is a lack of flexible approaches that allow for the identification of local spatial co-expression at user-defined cell-type resolution, offering a dynamic and interactive approach to cell transcriptome similarities.
Subsequent steps of co-expression identification usually include visualization and functional interpretation. Gene set enrichment analysis allows us to derive molecular insights with interpretable terms, such as known biological pathways and functions.8 Enrichment frameworks such as ToppGene9 and gProfiler10 offer application programming interface (API) access to integrate enrichment analysis into analytical tools.
SRT techniques measure gene expression information of tens to hundreds of genes in individual cells, while preserving their spatial context in the tissue.2 This enables the visualization of gene expression in the spatial domain. Despite significant improvements in the transcriptome resolution in recent years,2 a major challenge of imaging-based SRT is the limited number of gene detection due to constraints on the number of probes. In contrast, scRNA-seq aims to capture a detailed transcriptomic profile, but does not provide spatial context. Integrative tools have been developed to facilitate the visualization of gene expression in SRT data or to represent scRNA-seq data in reduced dimensions,11^,^12^,^13 such as through t-distributed stochastic neighbor embedding (t-SNE)14 and uniform manifold approximation and projection (UMAP).15 Additionally, computational imputation methods and machine learning models are actively being researched to integrate scRNA-seq data to expand the gene dimensionality of SRT data.16^,^17^,^18^,^19 However, interactive visualization software for visualizing and exploring large-scale scRNA-seq data, particularly in the 3D spatial domain, are still lacking.
In this work, we introduce GeneSurfer: an interactive interface for exploring localized transcriptome-wide gene co-expression patterns in the 3D spatial domain. The major contributions of our work are: (1) transcriptome-wide gene filtering and gene clustering based on spatial co-expression within transcriptomically similar cell populations, (2) real-time, interactive cell selection and multi-slice 3D rendering of average expression of gene clusters, and (3) on-the-fly Gene Ontology (GO) term annotation of co-expressed gene sets. In addition, we offer multiple linked views for interactive visualization of individual genes or gene co-expression in the spatial domain at each stage of the exploration. We demonstrate the utility of GeneSurfer using both multiplexed error-robust fluorescence in situ hybridization (MERFISH) and scRNA-seq data from the Allen Brain Cell (ABC) Atlas.1
Results
GeneSurfer overview
By offering multiple linked views of cells and genes, we aim to provide the user with an interactive interface to explore local gene expression patterns in the spatial domain and to gain potential biological insights from the identified spatial patterns (Video S1). We support the input of SRT datasets and, additionally, we also project the averages of transcriptome-wide gene expression obtained in scRNA-seq data onto the spatial domain by matching annotations (Figure 1B). This requires that both the SRT and the scRNA-seq are classified in the exact same cell type taxonomy.Figure 1. GeneSurfer is an interactive interface for exploring local gene expression patterns in the spatial domain(A) Overview of the proposed workflow. Users select transcriptomically similar cells and visualize them in 3D. Genes are filtered and then clustered based on their expression within the selected cells. GO terms are retrieved by selecting a gene set cluster, and individual genes can be visualized in 2D and 3D.(B) scRNA-seq expression averages of the annotations are projected in the spatial domain, allowing visual exploration of transcriptome-wide gene expression in 3D space.(C) Users can interactively select transcriptomically similar cells using SpaceWalker, annotations from the data taxonomy, or through HSNE or t-SNE embeddings.(D) After selecting a subset of cells, genes are filtered based on their local expression within the selected cells. Three metrics are available for gene filtering: Moran’s I, Pearson correlation between gene expression and corresponding spatial coordinates and differential expression between the selection and the whole data. Filtered genes are displayed in a bar chart, and are clustered into groups. Associated GO terms of the selected gene cluster are retrieved and displayed in a table. Views in GeneSurfer are linked to enable an interactive manner of exploration.
Video S1. Demonstration of interactive exploration with GeneSurfer using ABC Atlas data, related to Figure 1. The video showcases how users can select cells of interest, filter and cluster genes, retrieve associated GO terms and inspect gene expressions, for an interactive and dynamic analysis experience.
An overview of the proposed methodology is shown in Figure 1A. Our approach begins with a cell-centric exploration, where users select cells of interest that have similar gene expression profiles through one of the three selection methods (Figure 1C). First, users can select a cell type annotation from the annotated data taxonomy. Second, users can use our previous work SpaceWalker20 to interactively identify cells with transcriptomically similar profiles at user-defined cell locations. Third, users can re-embed the data on-the-fly using Hierarchical Stochastic Neighbor Embedding (HSNE)21 or t-SNE14 and select cell clusters from the embedding. As such, this initial step focuses on selecting cell populations of interest for further analysis.
Once similar cells are selected, this workflow instantaneously transitions to gene-centric exploration (Figure 1D), allowing for on-the-fly analysis of genes within the selected cells. Prior to gene set clustering, gene filtering is performed based on user-defined spatial metrics within the cell selection (Moran’s I statistics22 or the correlation of gene expression and coordinate axes in a standardized Common Coordinate Frame (CCF)). In addition, a non-spatial gene filter can be applied to identify genes highly expressed in the selected cells.
Based on the expression profiles of the selected cells, genes are grouped into clusters with similar local spatial patterns using hierarchical clustering.23 The average expression of genes in each cluster is shown in subviews to provide an overview of the expression patterns of the gene clusters. Gene set enrichment analysis is performed instantaneously on the user-selected gene cluster to provide insights into the potential biological functions or pathways associated with the genes. The output of enrichment analysis is presented in a table listing the statistically significant GO term annotations, their corresponding p-values, and associated gene symbols.
Overall, GeneSurfer allows a closed-loop, interactive exploration of gene expression in a user-defined subset of cells.
Projection of scRNA-seq annotation averages enables visual exploration of transcriptome-wide gene expression in 3D space
To evaluate the projection of annotation averages from scRNA-seq data into the spatial domain based on label correspondence, we used the ABC Atlas.1 The ABC Atlas is a mouse whole-brain transcriptomic cell type atlas, consisting of an annotated scRNA-seq dataset with approximately 4.0 million cells and over 30k genes and an annotated 3D SRT dataset with approximately 3.9 million cells and 500 genes using multiplexed error-robust fluorescence in situ hybridization (MERFISH). The fine-grained cell-type annotations in the ABC Atlas make it an appropriate benchmark dataset for validating our tool, while demonstrating the scalability and discovery potential of the proposed approach. The annotation level “Cluster” (5,322 clusters) is used for the projection.
To validate the 3D spatial projection of scRNA-seq annotation averages, we performed two quantitative experiments, comparing the spatial gene expression patterns of the 500 genes that are measured in both the MERFISH and scRNA-seq data. First, to establish the validity of replacing expression in the 3D spatial data with expression averages of the corresponding annotation from scRNA-seq data, we calculated the Pearson correlation at the cluster annotation level between the scRNA-seq annotation averages and the measured MERFISH data for each gene available in the MERFISH data. Figure 2A shows that the mean expression levels of genes in the clusters are highly correlated between the scRNA-seq clusters and the MERFISH clusters. Second, we calculated the spatial Pearson correlation between the 3D projection of the scRNA-seq annotation averages and the actual MERFISH measurement at the cell level in the 3D spatial domain. Figure 2B shows that the spatial correlations at the cell level are lower than those at the cell cluster level. Thus, we further inspected the individual gene expression by visually comparing the spatial patterns of scRNA-seq annotation averages and the MERFISH measurement. Several genes exhibit high correlation, particularly in genes with widespread and distinctive expression in the whole brain (e.g., Cbln1 in Figure 2C). The relatively low correlations might be due to the fact that the correlation at the cell level is calculated for the whole brain (∼4 million cells). A gene that expresses in the whole brain in a scattered way or a gene that only expresses highly in a small spatially localized hotspot would result in a low correlation coefficient, for example, Lsp1 and Foxa2 (Figure 2C). However, the main structure of the gene expression is preserved by label correspondence of scRNA-seq annotation averages, despite some instances of low correlation values.Figure 2. Comparison of projection of scRNA-seq annotation averages with the MERFISH measurement and Allen in situ hybridization data(A) The distributions of Pearson correlation coefficients between MERFISH measurement and scRNA-seq annotation averages in cell clusters.(B) The distributions of spatial Pearson correlation coefficients between MERFISH measurement and projection of scRNA-seq annotation averages across all cells.(C) Examples of spatial gene expression patterns and their spatial correlations of MERFISH measurement and projection of scRNA-seq annotation averages.(D) Examples of spatial gene expression patterns from Allen Brain Atlas in situ hybridization data and projection of scRNA-seq annotation averages. The Allen Brain Atlas in situ hybridization data are taken from https://mouse.brain-map.org/.4 In the 2D views of expression levels, a red-to-blue colormap is used, where red indicates high expression, light blue represents low expression, and cells with no expression are shown in dark blue. In the 3D views of expression levels, red represents high expression, light yellow indicates lower expression, and points with no expression are shown in transparent gray.
To evaluate the projection of scRNA-seq annotation averages for the genes not measured in the MERFISH, we compared the spatial gene expression between projection of the scRNA-seq annotation averages and the Allen Brain Atlas in situ hybridization data.4 Projection of the scRNA-seq annotation averages show excellent agreement with the Allen Brain Atlas in situ hybridization data (Figure 2D).
We also performed a comparison between the imputed expression matrix for 8,460 marker genes from the Allen Institute and our cluster average projection. The results show a strong correlation between the imputed expression and our cluster average projection, with an average Pearson correlation coefficient of 0.90 across all cells (Figure S1). This implies that the accuracy loss by projecting average expression from a fine-grained cellular taxonomy is limited, while providing significant advantages in computation speed and memory usage.
GeneSurfer identifies spatial co-expression within user-defined cell populations
GeneSurfer aims to identify genes with distinct spatial co-expression within interactively selected transcriptomically similar cell populations. To validate the gene filtering based on spatial expression, and the subsequent gene clustering, we next investigated whether GeneSurfer can identify distinct local spatial co-expression using different gene filtering options.
GeneSurfer provides three gene filtering options to enhance the further gene-centric exploration within selected cells: (1) correlation with a spatial coordinate axis in the CCF to identify genes exhibiting a spatial gradient along the anatomical brain axes, (2) Moran’s I22 to detect genes with non-random spatial expression patterns within the cell selection, and (3) differential expression between foreground and background cells to identify genes highly expressed in the selection. Filtered genes are clustered into groups, revealing potentially distinct spatial co-expression across the brain anatomy. Details on these gene filtering metrics and gene clustering are described in the STAR Methods section.
Figure 3 provides visual examples of the gene filtering and clustering results obtained with GeneSurfer using the projection of scRNA-seq annotation averages. The spatial correlation filters on the anterior-posterior or dorsal-ventral axis identify genes with spatial gradients along an anatomical axis in the CCF. We first selected the IT-ET Glut class from the data taxonomy to explore the spatial patterns within the excitatory neurons in the cortex. Genes with local gradients on the dorsal-ventral axis were filtered out and were clustered into two groups with distinct gradients: dorsal-ventral gradient and ventral-dorsal gradient (Figure 3A). Then, we drilled into L5 IT, L6 IT, and L5/6 IT Glut subclasses to explore the potential anterior-posterior gradients at a more refined cell-type level: excitatory neurons in the deeper cortical layers. Genes that exhibit an anterior-posterior gradient or posterior-anterior gradient within L5/L6 IT neurons were identified and subsequently clustered into two gene clusters (Figure 3B). The Moran’s I filter highlights genes with patterns that significantly deviate from randomness, suggesting potential functions within the spatial context. Figure 3C shows the two different expression patterns identified within the OB-IMN GABA class, where genes are filtered by the Moran’s I statistic. Genes in cluster 1 show a high expression in the subgranular zone, subventricular zone and rostral migratory stream (SGZ, SVZ, and RMS), which might be related to neuron differentiation and cell proliferation,1 including marker genes such as Prox1. Genes in cluster 2 are highly expressed in the olfactory bulb, with decreasing expression toward the posterior aspect of the brain, which might be related to mature olfactory bulb neurons.Figure 3. Example results of gene filtering using different metrics and subsequent gene clustering(A) Genes filtered by spatial correlation on the dorsal-ventral axis within IT-ET Glut neurons. Filtered genes were clustered into two groups: genes with positive spatial correlation show a superior-inferior gradient, with the reverse for genes with negative spatial correlation.(B) Genes filtered by spatial correlation on the anterior-posterior axis within L5 IT, L6 IT and L5/6 IT neurons. Filtered genes were clustered into two groups: genes with a high positive correlation display an anterior-posterior gradient, with the reverse for genes with negative spatial correlation.(C) Genes filtered using Moran’s I within OB-IMN GABA class exhibit non-random expression patterns and are clustered into two groups. The first column in (A), (B), and (C) displays average expression of genes in each cluster. The gene expression patterns shown are within the selected subset and are color-coded based on normalized values within the subset. Red indicates high expression, while light yellow represents lower expression. Points outside the selected subset are rendered in transparent gray to indicate the full 3D brain. (Parameters: 100 genes, 2 clusters).
These examples demonstrate GeneSurfer’s utility in filtering genes with spatial expression and clustering genes to identify local co-expression, and also guide users in selecting the most appropriate gene filtering metrics based on their specific research questions.
The differential expression filter is used to identify genes that are predominantly expressed in the selection. This is designed to explore cell-type specific gene expression which can subsequently be inspected in the spatial domain, which is demonstrated in the next section.
Interactive exploration with GeneSurfer confirms known biological function
In this section, we demonstrate the interactive exploration workflow with GeneSurfer on the whole transcriptome by validating its findings against known biology.
To evaluate the exploration workflow of GeneSurfer together with SpaceWalker, we demonstrate the workflow by exploring cells on a 2D brain slice. GO term annotations were then retrieved for these cell populations and compared to known biological pathways for these cells. When a seed cell location was selected in SpaceWalker, transcriptomically similar cells were selected by flood-fill20 on the fly. Subsequently in GeneSurfer, 50 genes were filtered out employing the differential expression filter that compares gene expression between selection and background cells. These genes were then clustered into two groups based on their local expression. GO term annotations were then instantaneously retrieved for gene function annotations. Figure 4 presents the results of selecting cells at four spatial locations that are close to each other spatially but transcriptomically and functionally distinct, annotated as subclass astrocytes (Astro-TE NN), oligodendrocytes (Oligo NN), pericytes (Peri NN), and microglia (Microglia NN). Genes associated with corresponding functions were identified and associated GO terms were retrieved, demonstrating that SpaceWalker identified transcriptomically similar cells and GeneSurfer discovered differentially expressed genes within this selected subset of cells, reflecting the biological function of these cells. These findings confirm the capability of GeneSurfer with SpaceWalker to analyze cellular components and gene expression in an explorative manner.Figure 4. Transcriptome-wide exploration using SpaceWalker and GeneSurfer retrieved cell-type specific GO term annotationsHovering over the 2D mouse brain section allows for selecting different cells at close-by locations (annotated as Astro-TE NN, Oligo NN, Peri NN and Microglia NN). Transcriptomically similar cells are identified by SpaceWalker and genes are filtered by the differential expression filter within the transcriptomically similar cells. For each selected cell location, GO terms associated with the filtered genes are retrieved, with the figure presenting the five GO terms having the lowest p-values. scRNA-seq annotation averages serve as input for gene filtering and clustering. GO terms for scRNA-seq genes are retrieved by ToppGene without a background gene set. All p-values in this paper have been adjusted using the Bonferroni correction. (GeneSurfer: differential expression filter, 50 genes, 2 clusters; SpaceWalker: flood nodes = 10, flood steps = 10).
It is important to note that cell annotations were only used to validate the results with the known annotation, but the methodology does not use these prior annotations in the gene-centric process and annotations are not required for this workflow. Annotations were only used to project scRNA-seq data to the spatial domain, not in the gene-centric exploration with selecting cells with SpaceWalker. Thus, SRT data without annotations can also be explored using this workflow.
In addition to selecting transcriptomically similar cells through SpaceWalker and the annotation taxonomy, as GeneSurfer is built on top of the ManiVault API,24 GeneSurfer seamlessly integrates with other ManiVault plugins to extend functionality, such as on-the-fly dimensionality reduction. Users can re-embed the whole dataset or a subset of the data, color the embedding using existing annotations, and select cells directly from the embedding to further explain the dataset in the gene-centric workflow.
Although this Genesurfer workflow is extremely powerful, the gene-centric clustering may be affected by the artifacts in the preprocessing of the MERFISH data. We show here an example of endothelial cells, which are annotated as three distinct clusters in the ABC Atlas.1 The re-embedded HSNE map using the MERFISH data reveals additional substructures within the Endo NN_1 cluster (Figure 5A). Selecting points of interest in the HSNE embedding, we observed spatially distinct structures within the Endo NN_1 cluster (Figure 5B). To further investigate whether this suggests additional cellular heterogeneity within the Endo NN_1 cluster in the MERFISH data, we performed differential expression analysis between these substructures and the Endo NN_1 cluster. We then analyzed the scRNA-seq data to inspect cell-specificity of the top differentially expressed genes. We found that each substructure was characterized by a small set of differentially expressed genes that were in most cases expressed in other cell types surrounding endothelial cells, but with region-specific abundance. Based on these observations, we hypothesize that the observed substructures in the spatial data could be caused by a spillover effect of gene transcripts between cell types during the segmentation of (small) endothelial cells. Thus, GeneSurfer enables users to examine the annotated data, and interactively differentiate potential cellular heterogeneity from underlying artifacts inherently present in the data.Figure 5. Demonstration of exploring endothelial cells with HSNE re-embedding(A) HSNE embedding of endothelial cells from the MERFISH data, colored by cluster annotation.1 Substructures in the embedding map are circled and numbered on the map.(B) The corresponding substructures marked in (A) are visualized in the 3D spatial domain, displaying distinct spatial structures. Inspection of the top differentially expressed genes and comparison with the scRNA-seq data lead to the conclusion that these sub-clusters are most likely an artifact, due to spillover from surrounding cells. The top 20 differentially expressed genes for each substructure are listed in Table S1.
Discussion
Recent advances in transcriptomics have led to the need for computational tools to explore large-scale datasets. Computational approaches have been developed to identify co-expressed genes, indicating their functional roles. However, current methods fall short in providing an exploratory tool that can interactively identify spatial co-expression, particularly in a local cellular subspace.
Here, we present GeneSurfer, an interactive interface to explore localized transcriptome-wide gene co-expression in the 3D spatial domain. It allows users to dynamically explore local co-expressed genes within the user-selected cells and on-the-fly GO term annotation of these co-expressed gene clusters. We also support an approximate visualization of scRNA-seq imputed gene expression onto the spatial domain, enabling the user to interactively explore both SRT and scRNA-seq gene expression in the 3D spatial domain. In addition, we offer multiple linked views for interactively investigating individual genes or gene co-expression in the spatial domain at each stage of the exploration. By integrating multiple ways to select cells of interest, three gene filtering options and various user-defined clustering and visualization parameters, GeneSurfer enables user-tailored exploration paths.
GeneSurfer has novel capabilities in comparison to existing approaches. Compared with existing algorithms for identifying co-expression modules in the whole data,5 we specifically target localized spatial co-expression within transcriptomically similar cells in an interactive manner. Second, unlike computational methods to integrate scRNA-seq with SRT data,16^,^17^,^18^,^19 which primarily focus on data integration, our tool extends the functionality to provide transcriptome-wide gene analysis and visualization in the 3D spatial domain by matching annotations. Moreover, while some integrative toolboxes offer interactive visualization,11^,^12^,^13 they do not support interactive and localized gene-centric on-the-fly computational analysis as our tool does. While our previous work SpaceWalker20 focuses on cell-centric exploration and expression gradient detection, GeneSurfer focuses on a gene-centric analysis approach to explore transcriptome-wide gene co-expression within selections of similar cells. Upon selecting cells, GeneSurfer performs the analysis on-the-fly, providing a responsive and interactive experience on a mid-range PC workstation.
We validated the functionality of GeneSurfer using the 3D whole-brain mouse dataset ABC Atlas. We demonstrated that GeneSurfer is able to identify and annotate local gene co-expression within selected cells, providing biological insights in real-time during exploration. We used the projection of annotation averages from scRNA-seq data onto the spatial domain to enable transcriptome-wide exploration, demonstrating its validity when both scRNA-seq and SRT datasets are annotated and available. Additionally, GeneSurfer can work with unannotated SRT datasets to help with hypothesis generation, and with the design of targeted validation experiments. It is compatible with both 3D and 2D datasets.
Future directions include extending the gene information retrieval capabilities, for example, dynamically retrieving detailed gene annotations from external databases during interactive exploration. Such functionality would enhance the biological interpretation and understanding of gene co-expression patterns. In addition, exploring and comparing alternative gene clustering algorithms could provide valuable insights. Specifically, methods that allow genes to belong to multiple, overlapping clusters would be of interest, as they may better reflect the biological complexity of gene co-expression patterns. Furthermore, while our projection approach has been validated on the MERFISH-measured genes (Figure 2) and Allen Institute imputed genes (Figure S1), future work could explore its performance on less-curated gene panels.
Overall, GeneSurfer offers an interactive visual interface for exploring local gene co-expression and their functional annotation.
Limitations of the study
To project transcriptome-wide gene expression onto the spatial domain, GeneSurfer requires that both the SRT and the scRNA-seq are classified in the same cell type taxonomy. This requirement applies specifically for visualizing the scRNA-seq in the spatial domain so that users can still fully explore the local gene patterns in the SRT dataset independently from scRNA-seq data. Furthermore, to address potential mismatches in cell-typing granularity, which is beyond the scope of GeneSurfer, users can consider supervised cell type annotation25^,^26 or utilizing resources like Cell Ontology27 to harmonize cell types between scRNA-seq and spatial data.
A major limitation of this projection approach is the assumption that the average expression of each annotation class is a sufficient representative for that annotation class. While this assumption is reasonable for datasets with fine-grained annotations, such as the ABC Atlas with 5,322 annotated clusters, it may lead to an oversimplification for datasets with coarser annotations. In such cases, this projection approach may smooth out gene expression heterogeneity within cell types. This can potentially lead to overlook of pathways within a certain annotated cell type in the further exploration. The projection and visualization of transcriptome-wide gene expression is specifically designed for datasets with fine-grained annotations. However, if such detailed annotations are not available, GeneSurfer remains fully functional for exploring SRT data without any annotations.
Resource availability
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Chang Li ([email protected]).
Materials availability
This study did not generate new unique reagents.
Data and code availability
- •This paper analyzes existing, publicly available data. These datasets are listed in the key resources table.
- •GeneSurfer is implemented in C++ as a plugin of the ManiVault plugin system24 for visual analytics application building. It can be downloaded, installed, and run as a standalone application. All original code, along with a Windows and Linux installer and system state files containing the data, plugins, and GUI configurations, is publicly available at https://github.com/ManiVaultStudio/GeneSurfer. Installers for MacOS and Linux will be made available at https://www.manivault.studio in the near future.
- •Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
Acknowledgments
This work received financial support from 10.13039/501100003246NWO Gravitation project BRAINSCAPES: A Roadmap from Neurogenetics to Neurobiology (NWO: 024.004.012); the NIH Brain Initiative Cell Atlas Network BICAN (HMBA: UM1MH130981); 10.13039/501100003246NWO TTW project 3DOMICS (NWO: 17126) and Alzheimer Netherlands project WE.03-2022-15.
Author contributions
C.L., T.H., and B.L. conceptualized the project. C.L., J.T., and T.K. implemented the software. C.L., X.v.d.B., L.v.d.W., and B.L. performed the analyses and interpreted the data. C.L. and B.L. wrote the manuscript with assistance and approval from all authors.
Declaration of interests
The authors declare that they have no competing interests.
STAR★Methods
Key resources table
REAGENT or RESOURCESOURCEIDENTIFIERDeposited dataAllen Brain Cell AtlasYao et al.1https://github.com/AllenInstitute/abc_atlas_access/blob/main/descriptions/MERFISH-C57BL6J-638850.mdDeveloping mouse brain datasetLa Manno et al.28http://mousebrain.org/development/downloads.htmlMouse organogenesis datasetLohoff et al.29https://marionilab.cruk.cam.ac.uk/SpatialMouseAtlas/Mouse embryo datasetSampath Kumar et al.31https://cellxgene.cziscience.com/collections/d74b6979-efba-47cd-990a-9d80ccf29055Software and algorithmsGeneSurferThis paperhttps://github.com/ManiVaultStudio/GeneSurfer
Method details
Data
The input data consists of a cell-by-gene expression matrix with spatial coordinates, optionally accompanied by metadata. GeneSurfer currently supports loading datasets in the h5ad and h5 file formats. Besides the input of SRT datasets, GeneSurfer optionally supports the projection of transcriptome-wide gene expression averages from scRNA-seq data onto the spatial domain. As described in the results section, projection of scRNA-seq data requires matching cell type annotations between the two datasets.
ABC Atlas data was downloaded from https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html. We used the MERFISH data, imputed MERFISH data and scRNA-seq data for the analysis in this paper.
Developing mouse brain data was downloaded from the Developing Mouse Brain Atlas28 http://mousebrain.org/development/downloads.html. SeqFISH mouse organogenesis data29 was downloaded from https://marionilab.cruk.cam.ac.uk/SpatialMouseAtlas/. Both the developing mouse brain data and the mouse organogenesis data were imputed and the annotations were transferred from scRNA-seq to the spatial SIRV data by Abdelaal et al.30
Slide-seqV2 mouse embryo data31 (E9.5: 201112_04) was downloaded from https://cellxgene.cziscience.com/collections/d74b6979-efba-47cd-990a-9d80ccf29055.
Projecting scRNA data to spatial domain
To visualize scRNA-seq data in the spatial domain, we utilized the matching cell annotations in the SRT and scRNA-seq data. After loading both SRT and scRNA-seq datasets, the user selects an annotation dataset to guide the alignment of the two datasets. The system checks whether both the SRT and scRNA-seq datasets contain the required metadata for the selected annotation. The average expression level is then calculated for each gene of each annotation category in the scRNA-seq data. These calculated average expressions are assigned to the spatial coordinates in the SRT data based on their corresponding annotations, replacing the actual measured MERFISH expressions at that location. In this way, we get an approximate gene expression matrix from scRNA-seq data with spatial coordinates, allowing for visualizing and analyzing the projected scRNA-seq gene sets in 2D and 3D spatial domain. In this manuscript, the annotation level “Cluster” is used for the ABC Atlas.
The prior condition is that an annotated SRT dataset and an annotated scRNA-seq dataset of the same region are required, where the latter covers major annotation types in the SRT dataset. We assume that the annotations in the datasets are fine-grained enough that the average expression of each annotation class is a representative approximation for that annotation class.
Since some of the scRNA-seq datasets can be prohibitively large, ingesting all of the data in memory to compute annotation averages may result in memory overflows or real-time computational difficulties. Users have the option to load a pre-computed average expression matrix instead of computing the matrix on-the-fly; this obviates the need to ingest the entire scRNAseq data to compute cluster statistics.
In summary, we bridge the gap in gene coverage between scRNA-seq and SRT data by directly projecting average expression based on fine-grained annotations, to provide an approximate visualization of scRNA-seq data in the spatial domain.
Selection of transcriptomically similar cells
GeneSurfer focuses on the gene-centric exploration within a subset of cells with similar transcriptome profiles. The user selects cells of interest in three interactive ways, and further gene-centric exploration is based on the selected subset of cells. To maximize flexibility and applicability, the three selection methods can be used independently, to ensure that users can still explore datasets even when annotations or embeddings are unavailable.
First, if annotation is present in the dataset, the user can select a cell type directly by selecting an annotation label from the data taxonomy.
Second, GeneSurfer is designed to seamlessly integrate with SpaceWalker20 by synchronizing the user-defined locations, where SpaceWalker automatically identifies transcriptomically similar cells at user-defined locations and GeneSurfer simultaneously provides further gene-centric exploration and GO term annotation within these cells. This has the advantage that cell selection is not restricted to prior discrete cluster boundaries, accommodating for a more gradual continuum of transcriptomic types.
Third, GeneSurfer can work with the dimensionality reduction plugin of ManiVault.24 This workflow allows users to select cells from embeddings, such as HSNE,21 t-SNE14 or UMAP,15 computed on-the-fly from the data. In addition, GeneSurfer can directly use externally computed dimensionality reduction results stored in standard formats (e.g., /obsm in.h5ad files), allowing users to interactively select cells from any preferred custom embedding.
Gene filtering
Since the number of genes/features in scRNA-seq data can reach 30k,3 we provide a filtering option before conducting gene clustering to remove uninformative genes. This step helps to selectively filter out genes that show specific spatial expression patterns or are highly expressed within the subset, ensuring a focus on the most informative genes for analysis.
The number of genes to keep for clustering is an interactively modifiable parameter. The user can dynamically observe the filtered genes displayed in the bar chart and a real-time overview of the clustered gene patterns while adjusting the desired number of genes. We provide three gene filtering modes: 1) differential expression between selected cells and the whole dataset, 2) Moran’s I, and 3) correlation of gene expression with spatial coordinate axes.
Differential: This option filters out genes that exhibit high expression among the selected cells, typically indicating cell type specific expression. Genes are sorted based on the difference in average expression levels between the selected cells and the whole dataset.
Moran’s I: Moran’s I22 measures spatial auto-correlation to identify genes with non-random expression patterns within the cell subset. Euclidean distances of the spatial coordinates are calculated to create a spatial weight matrix w. GeneSurfer then calculates Moran’s I for each gene using the weight matrix and gene expression within the cell subset as:
The calculated value is compared against the expected value under the null hypothesis:
Z score is calculated for each gene to indicate the statistical significance of the results. Genes are sorted by their Z-scores.
While statistical significance (Z > 1.96 or < −1.96, corresponding to p < 0.05) serves as a useful guideline, the interactive slider allows users to retain any desired number of genes to support flexible exploratory analysis.
Spatial: This option is for uncovering significant expression correlations along specific spatial axes in the CCF, such as the dorsal-ventral or anterior-posterior axis. It calculates the Pearson correlation coefficient of each gene’s expression values with the spatial coordinates within the cell subset. Genes with high correlation scores are likely to exhibit strong expression patterns along the specified axes in the spatial domain.
Pre-alignment to a standard anatomical atlas (such as the Allen Brain CCF) is ideal but not mandatory. For single-slice spatial datasets, meaningful spatial gradients can still be identified using image-derived coordinate axes (e.g., dorsal-ventral or anterior-posterior). Similarly, multi-slice datasets without CCF alignment can use the within-slice axis to identify gene expression gradients along the within-slice axis.
Gene clustering
After filtering genes, hierarchical clustering23 is applied to group these filtered genes based on their local expression within the cell subset. The clustering algorithm uses a distance metric defined as 1 minus Pearson correlation score between the local expression of each gene pair. Subsequently, an average expression view of each cluster is displayed on the spatial map, providing a visual overview of gene expression patterns within each cluster. Meanwhile, filtered genes are color-coded in the bar chart according to their clustering labels.
Determining the optimal number of clusters has always been a challenge with clustering algorithms, especially in the context of data that is dynamically updated during exploration with our tool. Too many clusters can lead to overclustering, reducing the distinctiveness of gene expression patterns across clusters. Conversely, too few clusters might group genes with distinct patterns into one single cluster. To address this issue, we enable users to adjust the number of clusters on-the-fly, allowing for the determination of an appropriate cluster number through direct, iterative inspection for different data.
Gene set functional enrichment analysis
Upon selecting a gene cluster, gene set functional enrichment analysis is performed on-the-fly using the gProfiler10 or ToppGene9 API, and the results are dynamically updated with each cluster selection. The gene symbols of the selected gene cluster are sent to the chosen API and the enrichment analysis results are sent back to our system. The statistically significant GO terms, p-values and associated gene symbols are displayed in a table linked to the bar chart. Clicking on a GO term of interest highlights the associated genes in the bar chart, allowing for follow-up inspection of individual genes in 2D and 3D visualizations.
Users have the flexibility to switch between these APIs based on their preference. The integration of gProfiler API allows users to specify the organism and supports a background gene set. For analysis of the limited SRT gene set, in addition to sending the genes to be analyzed, all genes measured in the SRT dataset are sent as background genes (500 genes for ABC Atlas). The ToppGene API integration does not support specifying organisms or providing custom backgrounds due to API constraints. Thus, the ToppGene enrichment option is enabled only with single-cell projections. Despite these limitations, it remains available as a complementary enrichment alternative. For whole-transcriptome scRNA-seq data (>30k genes for ABC Atlas), no background list is used.
While GeneSurfer offers integrated on-the-fly GO term retrieval, we emphasize that it is important to complement the computational analyses and generated hypotheses with wet-lab validation.
Closed-loop interaction and parameter selection
A key contribution of GeneSurfer, distinguishing it from existing scripted transcriptomics data analysis methods, is its highly interactive interface designed specifically for exploratory analysis (see Video S1). The interface includes adjustable toolbars that allow users to tailor their data exploration. It offers direct, closed-loop visual feedback, showing users the effect of changes to hyperparameters.
Figure S2A shows a screenshot of the software interface. The interactive exploration loop starts with the selection of transcriptomically similar cells, as detailed in the selection of transcriptomically similar cells section. Changes in the cell selection immediately trigger gene filtering and gene clustering based on user-defined settings, with options and adjustments detailed in the gene filtering and gene clustering section. Adjustments are made in real time using user interface controls, such as sliders and checkboxes, allowing users to see immediate updates in the bar chart of filtered genes and the clustering visualization of average gene expression. Clicking on a cluster triggers retrieval of the associated GO terms for genes in the selected cluster. The integration of cell selection, gene filtering and clustering, and visualization enables users to perform their analysis through a flexible, interactive interface.
One of our primary goals is to provide visualization of gene expression in the spatial domain, allowing users to inspect individual genes at each step of the exploration workflow. First, the user can visualize a specific gene in the dataset in 3D by entering the gene symbol in the search box (Figure S2B1). Second, selecting a GO term of interest from the annotation table highlights the associated genes in the bar chart (Figure S2B1). Third, hovering the mouse over the bars displays the gene symbol and highlights genes with the same cluster label. When the user clicks on a bar, the expression level of that gene is plotted in the spatial domain, enabling the user to select a gene of interest from the bar chart (Figure S2B2). This feature not only facilitates the exploration of genes associated with specific biological functions or pathways, but also allows for the inspection of individual gene expression in relation to selected GO terms.
In addition, point opacity for non-selected cells is adjustable from 0 to 1, providing flexibility in visualizing local expression. Setting the opacity to 0 isolates and displays only the selected cells, offering a focused view of the region of interest. Conversely, an opacity value of 1 renders all cells visible, providing a full overview of the spatial map. Intermediate opacity values allow for visual differentiation between the cells of interest and the surrounding cells, improving analytical clarity. Furthermore, the user can adjust the point size of the cells to adapt to the varying cell densities present in spatial datasets, enabling an optimal visualization across different scales of datasets.
Scalability and general applicability
GeneSurfer is designed for exploring datasets across varying scales. It supports both 2D and 3D datasets, offering interaction facilities such as cell selection in 2D visualization as well as 3D rendering to visualize expression at a whole-organ scale. While GeneSurfer works directly with SRT data, we extend its functionality to visualize the full transcriptome by projecting the scRNA-seq data onto the spatial domain.
Computational resource requirements depend primarily on the size of the datasets, with memory usage being the most significant factor. To demonstrate scalability and performance, we used the ABC Atlas, one of the largest datasets available with both scRNA-seq and MERFISH data in one atlas. We tested the performance of GeneSurfer on two computers: a workstation with a 6-core, 3.80GHz Intel processor, 128GB of RAM and an NVIDIA Quadro 2200 GPU, and a laptop with a 6-core, 2.60GHz Intel processor, 16GB of RAM and an NVIDIA GTX 1660Ti GPU.
Video S1 was recorded on the workstation, demonstrating that computations and visualizations were updated within milliseconds. This responsiveness offers a fully interactive exploration experience, even when working with large-scale datasets like the ABC Atlas.
On the laptop, we tested loading and interaction with the pre-saved ABC Atlas project. The initial loading time was about 9 min, mainly due to the memory limitation. However, once the dataset was loaded, the interface remained fully interactive. Although some lag was observed during exploration, it continued to function properly, demonstrating its ability to operate on less powerful hardware.
To assess general applicability across different SRT technologies, we additionally tested GeneSurfer on datasets from three other technologies: a developing mouse brain dataset28 (4,628 cells and 117 genes in HybISS data, 40,733 cells and 16,907 genes in scRNA-seq data), a SeqFISH mouse organogenesis dataset29 (17,806 cells and 351 genes) and a Slide-seqV2 mouse embryo dataset31 (24,634 cells and 27,554 genes in dataset E9.5: 201112_04). We showcased the exploration of the SeqFISH mouse organogenesis data in Video S2, and Figure S3 shows the exploration results on the Slide-seqV2 mouse embryo dataset. GeneSurfer performed smoothly across all these datasets, confirming its ability to handle datasets across SRT platforms with varying resolutions. The corresponding system state files, containing the data and GUI configurations are available in our repository.
Video S2. Demonstration of interactive exploration with GeneSurfer using SeqFISH mouse organogenesis data, related to STAR Methods
Quantification and statistical analysis
All analyses and visualizations presented in this manuscript (unless otherwise noted) were generated using our proposed software GeneSurfer, as described in the method details section. Specific parameter settings used in GeneSurfer are stated in the figure legends. Figures 2A, 2B, and S1A were generated using Python. Pearson correlation coefficients for these figures were calculated using SciPy (1.10.1), and plots were generated using Matplotlib (3.7.1).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Yao Z.van Velthoven C.T.J.Kunst M.Zhang M.Mc Millen D.Lee C.Jung W.Goldy J.Abdelhak A.Aitken M.A high-resolution transcriptomic and spatial atlas of cell types in the whole mouse brain Nature 62420233173323809291610.1038/s 41586-023-06812-z PMC 10719114 · doi ↗ · pubmed ↗
- 2Moses L.Pachter L.Museum of spatial transcriptomics Nat. Methods 19202253454610.1038/s 41592-022-01409-235273392 · doi ↗ · pubmed ↗
- 3Heumos L.Schaar A.C.Lance C.Litinetskaya A.Drost F.Zappia L.Lücken M.D.Strobl D.C.Henao J.Curion F.Best practices for single-cell analysis across modalities Nat. Rev. Genet.2420235505723700240310.1038/s 41576-023-00586-w PMC 10066026 · doi ↗ · pubmed ↗
- 4Lein E.S.Hawrylycz M.J.Ao N.Ayres M.Bensinger A.Bernard A.Boe A.F.Boguski M.S.Brockway K.S.Byrnes E.J.Genome-wide atlas of gene expression in the adult mouse brain Nature 44520071681761715160010.1038/nature 05453 · doi ↗ · pubmed ↗
- 5Saelens W.Cannoodt R.Saeys Y.A comprehensive evaluation of module detection methods for gene expression data Nat. Commun.9201810902954562210.1038/s 41467-018-03424-4PMC 5854612 · doi ↗ · pubmed ↗
- 6Hartigan J.A.Direct clustering of a data matrix J. Am. Stat. Assoc.671972123129
- 7Cheng Y.Church G.M.Biclustering of expression data Proc. Int. Conf. Intell. Syst. Mol. Biol.820009310310977070 · pubmed ↗
- 8Subramanian A.Tamayo P.Mootha V.K.Mukherjee S.Ebert B.L.Gillette M.A.Paulovich A.Pomeroy S.L.Golub T.R.Lander E.S.Mesirov J.P.Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles Proc. Natl. Acad. Sci. USA 102200515545155501619951710.1073/pnas.0506580102 PMC 1239896 · doi ↗ · pubmed ↗
