Analysis of community connectivity in spatial transcriptomics data

Juan Xie; Kyeong Joo Jung; Carter Allen; Yuzhou Chang; Subhadeep Paul; Zihai Li; Qin Ma; Dongjun Chung

PMC · DOI:10.3389/fams.2024.1403901·June 5, 2025

Analysis of community connectivity in spatial transcriptomics data

Juan Xie, Kyeong Joo Jung, Carter Allen, Yuzhou Chang, Subhadeep Paul, Zihai Li, Qin Ma, Dongjun Chung

PDF

Open Access

TL;DR

This paper introduces a new method to analyze how cell communities are connected in spatial transcriptomics data, helping understand cellular interactions in tissues like cancer.

Contribution

The paper introduces ACC and BANYAN, a novel Bayesian model for analyzing community connectivity in spatial transcriptomics.

Findings

01

BANYAN successfully recovers community connectivity structures in simulated and real spatial transcriptomics data.

02

ACC identifies distinct cliques of interacting cell sub-populations in cancer datasets.

03

The method is applied to melanoma brain metastases and lung cancer data, revealing new insights into cellular dynamics.

Abstract

The advent of high throughput spatial transcriptomics (HST) has allowed for unprecedented characterization of spatially distinct cell communities within a tissue sample. While a wide range of computational tools exist for detecting cell communities in HST data, none allow for the characterization of community connectivity, i.e., the relative similarity of cells within and between found communities—an analysis task that can elucidate cellular dynamics in important settings such as the tumor microenvironment. To address this gap, we introduce the analysis of community connectivity (ACC), which facilitates understanding of the relative similarity of cells within and between communities. We develop a Bayesian multi-layer network model called BANYAN for the integration of spatial and gene expression information to achieve ACC. We demonstrate BANYAN’s ability to recover community…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Mus musculus

Diseases4

melanoma invasive ductal carcinoma small-cell lung cancer tumor

Figures5

Click any figure to enlarge with its caption.

A](#F2)). When we selectively decrease SNR by setting $[eqn]$ , BANYAN is still able to perfectly recover sub-population labels. However, when we further decrease SNR by increasing $[eqn]$ , we find that sub-population inference is corrupted. Notably, as the SNR approaches 1, BANYAN merges sub-populations 4 and 5 first, as they feature high between-community connectivity *and* spatial proximity. Alternatively, BANYAN splits sub-population 1 into two distinct communities instead of combining sub-populations 1 and 3, which are spatially disjoint. In Supplementary Figure S2, we visualize this tre

C](#F3) (Supplementary Figure S5), we observe a decreasing density of cell-cell connectivity as we move from outside to within the tumor. This pattern suggests additional cellular heterogeneity within the tumor relative to the surrounding inflammatory and blood tissue components. When consulting the between-community connectivity parameters in [Figure 3D](#F3), we find two distinct pairs of cell sub-populations: (2,3) and (1,2), which feature significantly higher inter-connectivity than all other pairs of sub-populations. These three sub-populations comprise the tumor-external components of th

C](#F4). We find that the estimate of this parameter is near zero [as evidenced by the 95% credible for community pair (1,2) in [Figure 4C](#F4)], supporting our hypothesized dissimilarity between sub-populations 1 and 2. In fact, sub-population 1 featured very low between-community connectivity with all other sub-populations besides sub-population 4 [e.g., significantly higher connectivity was featured between sub-populations pairs (1,4) than (1,2) as shown in Supplementary Figure S7], which occupies a heterogeneous “background” position in the spatial landscape of the tissue sample ([Figure

A](#F5)), investigated the marker genes for each sub-population ([Figure 5B](#F5)), and examined the within- and between-community connectivity ([Figures 5C](#F5), [D](#F5)). The identified four sub-populations closely matched the spatially resolved neighborhood niches reported in the previous literature [[43](#R43)]. Each of BANYAN sub-populations 1 to 4 mainly correspond to tumor cells, myeloid-enriched stroma cells, lymphoid structures, and stroma cells, respectively. The spatial distribution of markers for each sub-population confirmed this correspondence. For instance, *KRT17* is a specif

Keywords

spatial transcriptomicsanalysis of community connectivitystochastic block modelBayesian modelsnetwork analysis

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSingle-cell and spatial transcriptomics · Gene expression and cancer classification · Bioinformatics and Genomic Networks

Full text

Introduction

1

The advent of spatial transcriptomics has allowed for unprecedented characterization of tissue architecture in terms of spatially resolved transcript abundance [1]. In particular, high throughput spatial transcriptomics (HST) technologies, such as the 10× Visium platform, have become popular due to their deeper transcriptome-wide sequencing depth. The proliferation of HST data has led to the development of several computational tools for discerning cell sub-populations in HST data while considering both gene expression and spatial information. The existing tools span a range of methodological categories, including neural networks [2–4], graph clustering algorithms [5, 6], and Bayesian statistical models [7, 8].

These methods are fundamentally limited in that they do not explicitly model the interactive nature of cell sub-populations in a tissue sample [9]. In other words, the sub-populations derived from existing methods are considered static, and no information is provided on how they relate to one another. Meanwhile, it is known that communication within and between groups of cells is a fundamental driver of healthy and diseased processes in complex tissue [10]. Moreover, Canozo et al. [4] report substantial heterogeneity within traditional mouse olfactory bulb layer annotations, driven in part by spatial variation in intercellular communication patterns. However, detecting higher resolution cell sup-populations with existing tools is challenging as there is a lack of model-based methodology for determining which cell suppopulations may be members of a common broader phenotype (e.g., immune or cancer cell sub-types) based on similar yet distinct gene expression or spatial location patterns. As a consequence, current tools cannot be used to study the community connectivity structure of cell sub-populations, i.e., the relative similarity among cells within and between sub-populations.

By studying community connectivity structure, we may obtain valuable insights into the interactive dynamics and spatial heterogeneity of cell sup-populations in challenging settings such as the tumor microenvironment. For example, instead of simply labeling categories of immune cells and cancer cells in a tumor, we may quantify how these important cell sub-populations relate to one another, and how tertiary intermediate sub-populations may be mediating important dynamics within the tumor microenvironment. Furthermore, characterizing community connectivity structure may help inform more biologically informative annotations of ambiguous sub-populations by relating them to more clearly defined sub-populations. Doing so may allow for a more holistic interpretation of all HST cell clusters in the common case when only a few cell clusters correspond clearly to a known cell type.

To address these gaps, we propose BANYAN (Bayesian ANalysis of communitY connectivity in spAtial single-cell Networks): a Bayesian statistical network model capable of discerning community connectivity structure in HST data. BANYAN draws inspiration from the vast field of biological network analysis [11], and is built on the supposition that HST data is most accurately represented as similarity networks that reflect similarity between cell spots in terms of spatial location and transcriptional profiles. As opposed to simple comparisons of marker gene expression across cell sub-populations, quantification of similarity metrics can more effectively represent the information contained in thousands of gene markers. To this end, BANYAN introduces a Bayesian multi-layer stochastic block model [12, 13] that infers a community connectivity structure to characterize the relationships between cell spots both within and between sub-populations, based jointly on transcriptional and spatial similarity between cell spots. We offer convenient implementation and interactive visualization functionality via the R package BANYAN.

Methods

2

BANYAN is the first HST computational tool to allow for analysis of community connectivity (ACC), i.e., the process of inferring the similarity of cell spots within and between sub-populations. A graphical representation is given in Figure 1, and the workflow to achieve ACC can be summarized as follows. First, given cell spot-level gene expression features and spatial coordinate data from HST platforms, we construct two spot-spot nearest neighbor networks. These networks are then integrated into a multi-layer graph data structure. Then, we fit a Bayesian multilayer stochastic block model (MLSBM), which assumes that spatial location and gene expression patterns of cell spots arise from a common community structure. The estimated parameters from this model allow us to infer the community structure of the tissue sample by quantifying the relative similarity between cell spots within and between sub-populations.

Data pre-processing

2.1

To represent the interactive nature of cells and cell types, we adopt two cell-cell similarity networks as our primary data objects: one for gene expression and another for spatial location. To form the cell spot-cell spot gene expression similarity network, we first apply standard pre-processing steps including scaling, removal of technical artifacts, and identification of highly variable genes [14–16]. We then embed each of the $[eqn]$ total cell spots in a lower-dimensional space using principal components analysis (PCA) applied to the top 2,000 most variable genes. To form the cell spot-cell spot gene expression similarity matrix, we represent each cell spot as a node and connect each cell spot to its $[eqn]$ closest neighboring cell spots in the gene expression principal component space using a binary edge. We utilize the same approach to construct the spatial cell spot-cell spot similarity network, where principal components are replaced with 2-dimensional spatial coordinates. The resultant data structure is two networks with $[eqn]$ nodes, each of degree $[eqn]$ . By default, we adopt the widely used heuristic of choosing $[eqn]$ as the closest odd integer to $[eqn]$ [17], which allows the number of neighboring spots to increase as the size of the tissue sample increases. With the typical HST experiment yielding a total number of cell spots between 2,000 and 3,000, this heuristic leads to consideration of between third- and fourth-order neighborhood structures (Supplementary Figure S1). Overall, we view $[eqn]$ as a tuning parameter that may be adjusted depending on the amount of information sharing desired across a tissue sample.

Model

2.2

We develop the core statistical model within BANYAN as an extension of the widely used stochastic block model [18], a flexible generative model for network data that allows for the assessment of community structure based on the frequency of binary edges among and between subsets of nodes. We define $[eqn]$ as the $[eqn]$ binary adjacency matrix encoding the gene expression similarity network, and $[eqn]$ as the binary adjacency matrix encoding the spatial similarity network. The matrix elements $[eqn]$ and $[eqn]$ indicate the presence or absence of a binary un-directed edge between nodes $[eqn]$ and $[eqn]$ for gene expression and spatial information, respectively. We define $[eqn]$ as the multi-layer graph that encodes similarity between cell spots in terms of both gene expression and spatial information. While we focus on the integration of spatial and gene expression information, our proposed framework may be extended to $[eqn]$ layers to incorporate other sources of information from multiplexed experimental assays.

Given the multi-layer graph data $[eqn]$ , we assume that the absence or presence of edges in each layer between each pair of nodes $[eqn]$ and $[eqn]$ follows a Bernoulli distribution with probability of an edge $[eqn]$ , where $[eqn]$ denotes the latent cell spot sub-population assignment for cell spot $[eqn]$ . We refer to such a model as MLSBM. Formally, we assume for $[eqn]$ ,

[eqn]

where $[eqn]$ , and $[eqn]$ is a $[eqn]$ connectivity matrix with diagonal elements $[eqn]$ for $[eqn]$ controlling the probability of an edge occurring between two cell spots in the same sub-population, and off-diagonal elements $[eqn]$ for $[eqn]$ controlling the probability of an edge occurring between two nodes in different cell spot sub-populations. Importantly, Model (1) implies that connections among cell spots in the gene expression and spatial layers are governed by a common set of community structure parameters $[eqn]$ and $[eqn]$ . Note that in the graph $[eqn]$ , the edges corresponding to the cases that $[eqn]$ and $[eqn]$ usually constitute the core of the community, while the edges corresponding to the cases that $[eqn]$ and $[eqn]$ , or $[eqn]$ and $[eqn]$ , usually constitute the outskirts of the community. Given Model (1) and data $[eqn]$ , our primary inferential objective is to characterize the cell spot-cell spot interaction both within and between them by estimating the parameters $[eqn]$ , which we accomplish using a Bayesian approach as described below.

Bayesian inference

2.3

Priors

2.3.1

To achieve a fully Bayesian parameter estimation scheme, we assign prior distributions to all model parameters. We adopt available conjugate priors to obtain closed-form full conditional distributions of all model parameters, allowing for straightforward Gibbs sampling. For the latent cell sub-population indicators $[eqn]$ , we assume a conjugate multinomial-Dirichlet prior with $[eqn]$ for $[eqn]$ , and $[eqn]$ , where $[eqn]$ controls the relative size of each cell sub-population to allow for a heterogeneous distribution of cell type abundances. We adopt a conjugate Beta-Bernoulli prior for $[eqn]$ by assuming $[eqn]$ for $[eqn]$ . As a default, we opt for weakly informative priors by setting $[eqn]$ and $[eqn]$ [19].

Markov chain Monte Carlo (MCMC) algorithm

2.3.2

The model proposed in Sections 2.2 and 2.3.1 allows for closed-form full conditional distributions of all model parameters. Thus, we adopt the following Gibbs sampling algorithm for parameter estimation. In practice, we recommend initializing the indicators $[eqn]$ using a heuristic graph clustering method such as the Louvain algorithm [20] applied to $[eqn]$ to facilitate timely model convergence.

Update $[eqn]$ from its full conditional $[eqn]$ , where $[eqn]$ , and $[eqn]$ is the number of nodes assigned to cell sub-population $[eqn]$ at the current MCMC iteration, i.e., $[eqn]$ .For $[eqn]$ , update $[eqn]$ from

[eqn]

where $[eqn]$ are the number of observed edges between communities $[eqn]$ and $[eqn]$ across both layers, and $[eqn]$ are the number of possible edges between communities $[eqn]$ and $[eqn]$ is the number of nodes assigned to cell sup-population $[eqn]$ , and $[eqn]$ is the indicator function equal to 1 if $[eqn]$ and 0 otherwise.For $[eqn]$ , update $[eqn]$ from $[eqn]$ , where $[eqn]$ and

[eqn]

Label switching

2.3.3

Label switching is a ubiquitous issue faced by models whose likelihood is invariant to permutations of a latent categorical variable such as $[eqn]$ . Consequently, stochastically equivalent permutations of $[eqn]$ may occur over the course of MCMC sampling, causing the estimates of all community-specific parameters to be conflated, thereby jeopardizing the accuracy of model parameter estimates. Previous approaches for addressing label switching rely on re-shuffling posterior samples after completion of the MCMC algorithm [21]. However, such methods rely on prediction and are subject to prediction error. To protect against label switching within the MCMC sampler, we adopt the canonical projection of $[eqn]$ proposed by Peng and Carvalho [22], who restrict updates of $[eqn]$ to the reduced sample space $[eqn]$ , wherein label switching is less likely due to the restricted sample space. In practice, we manually permute $[eqn]$ at each MCMC iteration such that community 1 appears first in $[eqn]$ , community 2 appears second in $[eqn]$ , et cetera. Finally, we estimate $[eqn]$ using the maximum a posteriori (MAP) estimate across all post-burn MCMC samples [19].

Analysis of community connectivity

2.4

Estimation of the MLSBM model parameters $[eqn]$ with the corresponding maximum a posteriori estimates $[eqn]$ allows for inference of community connectivity structure in HST data. While the estimated community labeling vector $[eqn]$ is what we use to define communities, the elements of $[eqn]$ describe how cell spots within and between communities relate to one another, thereby characterizing community connectivity. Specifically, elements $[eqn]$ reflect the estimated probability of a randomly chosen cell spot in community $[eqn]$ sharing a nearest neighbors edge in $[eqn]$ with a cell spot in community $[eqn]$ . When $[eqn]$ reflects the average connectivity within a community, which may be used to assess the relative homogeneity of a community. Heterogeneous communities tend to have lower average within-community connectivity, while more homogeneous communities tend to have higher within-community connectivity. Likewise, when $[eqn]$ represents the probability of connection between cell spots in two distinct communities. This between-community connectivity measurement allows us to discern closely related communities that may contain similar cell types from more distinct communities. Taken together, these between and within-community connectivity parameters capacitate analysis of community connectivity.

Software implementation

2.5

We provide the R package banyan for convenient implementation of the proposed workflow. The banyan package efficiently implements Bayesian estimation using custom Gibbs sampling algorithms implemented in C++ using Rcpp. The core model fitting functions integrate seamlessly with standard Seurat [23] data structures, allowing users to easily incorporate ACC into existing HST analysis workflows. Further, banyan allows users to investigate community connectivity using external sub-population labels, thereby encouraging widespread utility of ACC. As clustering algorithms for HST proliferate—each with different assumptions and optimal use cases, utilizing BANYAN for post-hoc community connectivity analysis will help elucidate tissue heterogeneity across the widest possible range of application settings. We developed both interactive and static visualization functions for interrogation of BANYAN sub-population labels and community connectivity structure. The banyan package interfaces seamlessly with standard Seurat workflows, and is freely available at https://github.com/dongjunchung/banyan.

Result

3

Simulation studies show BANYAN effectively identifies underlying community connectivity structures under a broad range of signal-to-noise ratio settings

3.1

We designed a simulation study to validate the performance of the MLSBM employed by BANYAN. We adopted a publicly available sagittal mouse brain data set [24] sequenced with the 10× Visium platform. In our simulation data, we manually allocated the $[eqn]$ total cell spots in the original sagital mouse brain data set into five spatially contiguous mouse brain layers. Assuming that a community structure is given as

[eqn]

we defined the signal-to-noise ratio (SNR) of the simulated gene expression network as $[eqn]$ , i.e., the ratio of the within- to between-community connectivity. SNR values much greater than 1 give rise to a strong community structure in the simulated data, while SNR values close to 1 result in a weaker community structure. We do not consider values of SNR below 1, as the resultant disassortative community structure is not reflective of cell type structure in HST data. In addition, we note that SNR was close to 10 in all the real data applications we consider in the following sections, i.e., the ranges of SNR we considered in these simulation studies are significantly lower than those we observe in real datasets. Hence, given the usual SNR levels we observe in real datasets, BANYAN is expected to effectively recover the underlying community connectivity structure.

We explore the effects of varying between-community connectivity and within-community connectivity to modulate the signals-to-noise ratio (SNR). We used the general community structure given by

[eqn]

where by default $[eqn]$ for $[eqn]$ and $[eqn]$ otherwise. In Figure 2 we visualize results from three different variations of Equation (2) described above. In the first setting, we varied the between-community connectivity between pair of spatially disjoint communities 1 and 3 and the pair of bordering communities 4 and 5 to analyze the interplay between between-community connectivity and spatial co-localization (Figure 2A). When we selectively decrease SNR by setting $[eqn]$ , BANYAN is still able to perfectly recover sub-population labels. However, when we further decrease SNR by increasing $[eqn]$ , we find that sub-population inference is corrupted. Notably, as the SNR approaches 1, BANYAN merges sub-populations 4 and 5 first, as they feature high between-community connectivity and spatial proximity. Alternatively, BANYAN splits sub-population 1 into two distinct communities instead of combining sub-populations 1 and 3, which are spatially disjoint. In Supplementary Figure S2, we visualize this trend across a finer grid of $[eqn]$ and $[eqn]$ . In Figure 2B, we demonstrate this phenomenon from different perspective, in which decreasing the clustering resolution from $[eqn]$ to $[eqn]$ features merging of the two distinct sub-population 1 components due to their spatial proximity and high connectivity. Finally, in setting 3 (Figure 2C), we investigated the effect of selectively decreasing within-community connectivity parameters for sub-populations 1, 2, and 3. We find that at low SNR settings of $[eqn]$ , BANYAN is unable to properly allocate cell spot labels, while increasing the SNR by increasing $[eqn]$ results in correct recovery of ground truth labels. In Supplementary Figure S3, we provide results from across a finer grid of $[eqn]$ , and $[eqn]$ . The results from settings 1–3 in Figure 2 highlight a characteristic of the spatially-aware MLSBM, namely that the model places a preference on merging spatially neighboring communities instead of spatially separate communities when the SNRs for each pair are equally low (i.e., approaches SNR = 1).

Identifying cellular interplay in human melanoma brain metastases

3.2

Brain metastases are a common cancer complication, arising most often from lung cancer, breast cancer, and melanoma and occurring in nearly 30% of patients with solid tumors [25]. In the United States, an estimated 98,000 to 170,000 patients are diagnosed with brain metastases each year, and the incidence is increasing [26]. Due to the fact that conventional therapies can rarely cure brain metastases, researchers have been seeking alternative treatment options, and immunotherapy is one promising candidate [27]. In recent years, many scientific efforts have been devoted to investigating the interaction between the immune system and the tumor microenvironment (TME) of brain metastases, shedding light on the immune biology of brain metastases. For instance, Sudmeier et al. [28] reported that human brain metastases are well infiltrated by $[eqn]$ T cells.

To better understand the spatial distribution of immune cells in brain metastases TME and their interactive relationship with tumor cells, we applied BANYAN to the human melanoma brain metastasis sample from Sudmeier et al. [28], who applied spatial transcriptomics and identified distinct tumor, inflammatory, and blood cell sub-populations. Using BANYAN, we identified four spatially distinct spot sub-populations (Figure 3A), characterized the function of each sub-population in the TME using known marker genes (Figure 3B; Supplementary Figure S4), and studied the similarity structure among cell spots within and between sub-populations (Figures 3C, D). The identified sub-populations from BANYAN closely resemble the TME regions reported by Sudmeier et al. [28]. BANYAN sub-population 1 corresponds to blood cells, sub-population 2 to inflammatory immune cells, sub-population 3 to tumor-inflammatory adjacent cells, and sub-population 4 to the tumor region, respectively. Functional annotation based on the expression of marker genes (Figure 3B; Supplementary Figure S4) further suggests sub-population 1 consists of naive cells, while both sub-populations 2 and 3 are $[eqn]$ T cells, and the expression of $[eqn]$ T cell markers are higher in sub-population 3 compared to those in sub-population 2.

We then utilized BANYAN to implement ACC and characterize the interplay among the four identified sub-populations—a unique functionality not offered by other HST analysis tools. When investigating within-community connectivity parameters in Figure 3C (Supplementary Figure S5), we observe a decreasing density of cell-cell connectivity as we move from outside to within the tumor. This pattern suggests additional cellular heterogeneity within the tumor relative to the surrounding inflammatory and blood tissue components. When consulting the between-community connectivity parameters in Figure 3D, we find two distinct pairs of cell sub-populations: (2,3) and (1,2), which feature significantly higher inter-connectivity than all other pairs of sub-populations. These three sub-populations comprise the tumor-external components of the tissue sample and reflect a relatively high degree of inter-connectivity between blood, immune, and tumor-adjacent sub-populations. In comparison, the tumor region (sub-population 4) featured significantly lower inter-connectivity with the rest of the tissue sample, as evidenced by the significantly lower between-connectivity parameter estimates for the pairs of (3,4), (1,4), and (2,4) in Figure 3D.

To better understand connectivity, we further analyzed the paired single-cell T cell receptor (TCR) sequencing data in terms of repertoire overlap and diversity (Supplementary Section S1). It turned out that the pairs with higher BCC values correspond to those with higher TCR similarity (Figure 3E). Furthermore, sub-populations with higher WCC values (e.g., sub-populations 1 and 2) exhibited lower repertoire diversity compared to the ones with lower WCC values (Figure 3F). The above relationships between WCC/BCC and TCR repertoire indicate that ACC holds the potential to uncover cellular dynamics under the setting of TME.

Discovering community structure in invasive ductal carcinoma

3.3

Accounting for roughly 25% of all non-dermal cancers in women, breast cancer ranks as the most common non-dermal female-specific cancer type, and narrowly the most common cancer type across both sexes [29]. Of all sub-types, invasive ductal carcinoma (IDC) is the most common and most severe, accounting for roughly 80% of all breast cancers in women [30]. While previous authors have used spatial transcriptomics to study IDC samples relative to ductal carcinoma in situ (DCIS) samples [31], IDC has yet to be studied through the lens of community structure due to the lack of computational tools available for performing ACC with HST data.

To illustrate ACC in the tumor microenvironment, we applied BANYAN to a publicly available IDC sample sequenced with the 10× Visium platform [32]. We identified five spatially distinct cell spot sub-populations (Figure 4A), and then identified community structure by computing posterior estimates of within and between-community connectivity parameters, as displayed in Figures 4B, C, respectively. Finally, to interpret each sub-population in terms of IDC biology, we computed the most differentially expressed genes between each sub-population and all others using the Wilcoxon rank-sum test implemented in the Seurat package (details in Supplementary material) (Figure 4D). Figure 4D displays a clear block structure in the expression of sub-population marker genes, indicating a strong community structure signal in the data. When considered together with their spatial distribution (Supplementary Figure S6), these marker genes can be used to obtain many interesting biological insights regarding the community structure of the IDC sample. For instance, the S100A11 gene, a marker for sub-population 1, is a diagnostic marker in breast cancers [33] and has been implicated in aggressive tumor progression [34]. Further, KRT8 is used to differentiate aggressive grades of IDCs [35]. While outside of the context of IDCs, DEGS2 has been shown to play a role in the invasion and metastasis of colorectal cancer [36]. Taken together, these marker genes suggest that sub-population 1 contains a relatively high abundance of aggressive and invasive cancer cell types. On the other hand, sub-population 2 featured marker genes such as MALAT1 that are associated with tumor suppressive behaviors in IDCs [37]. Another marker gene for sub-population 2, CCDC80, has been linked with tumor suppressive functions, albeit not in the context of IDCs [38].

Given these brief characterizations of sub-populations 1 and 2 available from the existing literature, we may hypothesize that these groups of cell spots are in some sense opposed in terms of their role within the tumor based on their transcriptional profiles. Indeed, these sub-populations also reside spatially at opposite ends of the tumor slice. We may investigate the similarity or dissimilarity of these sub-populations 1 and 2 using the between-community connectivity parameters presented in Figure 4C. We find that the estimate of this parameter is near zero [as evidenced by the 95% credible for community pair (1,2) in Figure 4C], supporting our hypothesized dissimilarity between sub-populations 1 and 2. In fact, sub-population 1 featured very low between-community connectivity with all other sub-populations besides sub-population 4 [e.g., significantly higher connectivity was featured between sub-populations pairs (1,4) than (1,2) as shown in Supplementary Figure S7], which occupies a heterogeneous “background” position in the spatial landscape of the tissue sample (Figure 4A) and therefore featured relatively high connectivity with all other communities. This spatial heterogeneity is accompanied by relatively low within-community connectivity (Figure 4B), which indicates that spot-spot similarities are less common between cell spots in sub-population 4 than in other sub-populations. In Figure 4D, it can be seen that many of the marker genes for sub-population 2 are shared by sub-population 4, including MALAT1, suggesting a similarity between these two sub-populations in terms of transcriptional profiles. In addition to the marker genes shared with sub-population 2, sub-population 4 features several of its own distinct marker genes, namely the immunoglobulin heavy chain-encoding RNAs IGHG1 and IGHG3. These genes themselves have been shown to feature tumor suppressive tendencies via promotion of B cell-specific immunoglobulin [39], and have been associated with increased patient survival [40]. This observation of functional similarity between sub-populations 2 and 4 is validated by Figure 4C, which clearly shows the highest estimated between-community connectivity in the data occurring between sub-populations 2 and 4. Taken together, these observations may lead us to reason that the sub-population 1 vs. 2 dynamic described previously is linked via the more heterogeneous yet still tumor suppressive-like sub-population 4. While these observations would require further experimental validation to confirm, they showcase the unique ability of BANYAN to describe community structure in the data.

Discovering spatial niches in human non-small cell lung cancer tissues

3.4

Next, we applied BANYAN to public data from the NanoString CosMx Spatial Molecular Imager (SMI) platform [41]. The original dataset consists of measurements of eight samples from five non-small-cell lung cancer (NSCLC) formalin-fixed paraffin-embedded (FFPE) tissues, with a total of 800,327 cells. Here we used one of eight samples (lung 5, replicate 1) for the illustration purpose. We further extracted cells corresponding to basal, macrophage, $[eqn]$ T, and $[eqn]$ T cells types based on the annotation using the Azimuth Healthy Human Lung reference [14, 42]. Finally, we randomly downsampled the original data to 8,000 cells to aid in readability and allow for more rapid MCMC convergence.

Using BANYAN, we identified four sub-populations (Figure 5A), investigated the marker genes for each sub-population (Figure 5B), and examined the within- and between-community connectivity (Figures 5C, D). The identified four sub-populations closely matched the spatially resolved neighborhood niches reported in the previous literature [43]. Each of BANYAN sub-populations 1 to 4 mainly correspond to tumor cells, myeloid-enriched stroma cells, lymphoid structures, and stroma cells, respectively. The spatial distribution of markers for each sub-population confirmed this correspondence. For instance, KRT17 is a specific marker for basal cells and commonly used diagnostic marker for tumors [44], and it is highly expressed in sub-population 1. C1QA, C1QB, and C1QC are markers for macrophages [45] (macrophages are myeloid lineage cells [46]), and they are significantly highly expressed in sub-population 2. IL7R is a lymphoid-associated gene and we observed its over-expression in sub-population 3.

While examining these sub-populations, we found that the sub-population 4 (stroma cells) is more heterogeneous than the other sub-populations. It consists of all four cell types and each cell type constitutes a fair proportion of the sub-population, while other sub-populations are mostly dominated by only one cell type. This heterogeneity differences among sub-populations may further explain the within-community connectivity: sub-population 3 is dominated by $[eqn]$ and $[eqn]$ T cells (proportion around 90%) and has the highest within-community connectivity. Likewise, nearly 81% of sub-population 2 are macrophages, potentially explaining its second-highest within-community connectivity. For sub-population 1, 99% of the cells are tumor cells, which themselves display high heterogeneity. In the case of between-community connectivity, as before, the observations may be mainly explained by spatial adjacency: sub-population 4 neighbors with all the other sub-populations and the connectivity involved in this sub-population are high [e.g., the (2,4), (3,4) and (1,4) pairs].

Discussion

4

We have proposed BANYAN: a network-based statistical framework for the analysis of community connectivity in HST data. In our simulation study, we validate BANYAN’s ability to recover the community connectivity structure, even in the case of relatively low SNR, by considering both gene expression similarity and spatial proximity. We applied BANYAN to human melanoma brain metastases, human breast cancer, and human lung cancer, to illustrate its utility in applied settings. In the human melanoma brain metastasis case study, within-community connectivity parameters indicated increasing within-community heterogeneity as we move from outside to within the tumor. In addition, between-community connectivity parameters indicated a higher degree of inter-connectivity between blood, immune, and tumor-adjacent subpopulations, compared to those associated with the tumor region. Besides, we found interesting relationships between community WCC/BCC and TCR repertoire, which indicates that ACC holds the potential to uncover cellular dynamics under the setting of TME. In the breast cancer case study, we found a strong community structure, with sub-populations marked by both invasive cancer and cancer-suppressive marker genes. Using community structure parameters, we also identified an intermediate sub-population between these two. In the human-small-cell lung cancer case study, we observe the relevance of within-community connectivity with the heterogeneity of each cell spot cluster, as illustrated with the stroma cell sub-population.

There are several ways our work may be extended. First, often the SBM is refined to accommodate heterogeneous degree distributions among nodes, i.e., degree correction [47]. By making this methodological extension to the MLSBM at the core of BANYAN, one could relax our assumption that each cell spot features the same number of neighbors and thereby allow for certain cells spots to feature more connections to the rest of the tissue than other cell spots, such as those on the periphery of the tissue sample. Learning the degree of each cell spot would then inform the detection of highly connected “hub” regions, or weakly connected “satellite” regions of a tissue sample. Second, it is possible to allow gene expression and spatial information to be weighed in a data-adaptive manner, although tuning of appropriate weights would be necessary. Third, another extension could be to relax the assumption that gene expression and spatial location layers are governed by common community structure parameters, and instead allow for layer-specific interpretations of community structure. Fourth, the inherent complexity of network data structures leads to a heavy computational burden for large HST experiments. While we implement our proposed MCMC sampling algorithm using efficient Rcpp routines, BANYAN still requires significantly more computational time than non-network statistical methods [7, 8]. Further optimization would help to reduce the computational burden of community connectivity analysis. Finally, while BANYAN provides the first statistical framework for quantifying community connectivity structure in HST data, further extensions could be made to link BANYAN with methods for predicting cell-cell interactions using data such as ligand-receptor pair status of cells. By doing so, one could refine the general notion of cell spot connectivity to cell spot interaction, which is of major interest in HST data analysis. In this sense, BANYAN establishes a promising statistical framework that may be extended to a wide range of analyses focused on investigating the interactive nature of HST data.

Supplementary Material

Data Sheet 1

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Asp M, Bergenstrahle J, Lundeberg J. Spatially resolved transcriptomes’ next generation tools for tissue exploration. Bio Essays. (2020) 42:1900221. doi: 10.1002/bies.20190022132363691 · doi ↗ · pubmed ↗
2Chang Y, He F, Wang J, Chen S, Li J, Liu J, Define and visualize pathological architectures of human tissues from spatially resolved transcriptomics using deep learning. Comput Struct Biotechnol J. (2022) 20:4600–17. doi: 10.1016/j.csbj.2022.08.02936090815 PMC 9440291 · doi ↗ · pubmed ↗
3Hu J, Li X, Coleman K, Schroeder A, Ma N, Irwin DJ, Spa GCN: integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat Methods. (2021) 18:1342–51. doi: 10.1038/s 41592-021-01255-834711970 · doi ↗ · pubmed ↗
4Canozo FJG, Zuo Z, Martin JF, Samee MAH. Cell-type modeling in spatial transcriptomics data elucidates spatially variable colocalization and communication between cell-types in mouse brain. Cell Syst. (2022) 13:58–70. doi: 10.1016/j.cels.2021.09.00434626538 PMC 8776574 · doi ↗ · pubmed ↗
5Dries R, Zhu Q, Dong R, Eng CHL Li H, Liu K, Giotto: a toolbox for integrative analysis and visualization of spatial expression data. Genome Biol. (2021) 22:1–31. doi: 10.1186/s 13059-021-02286-233685491 PMC 7938609 · doi ↗ · pubmed ↗
6Pham D, Tan X, Balderson B, Xu J, Grice LF, Yoon S, Robust mapping of spatiotemporal trajectories and cell-cell interactions in healthy and diseased tissues. Nat Commun. (2023) 14:7739. doi: 10.1038/s 41467-023-43120-638007580 PMC 10676408 · doi ↗ · pubmed ↗
7Zhao E, Stone MR, Ren X, Pulliam T, Nghiem P, Bielas JH, Spatial transcriptomics at subspot resolution with Bayes Space. Nat Biotechnol. (2021) 39:1375–84. doi: 10.1038/s 41587-021-00935-234083791 PMC 8763026 · doi ↗ · pubmed ↗
8Allen C, Chang Y, Neelon B, Chang W, Kim HJ, Li Z, A Bayesian multivariate mixture model for spatial transcriptomics data. Biometrics. (2023) 79:1775–87. doi: 10.1111/biom.1372735895854 PMC 10134739 · doi ↗ · pubmed ↗