MCSPACE: inferring microbiome spatiotemporal dynamics from high-throughput co-localization data

Gurdip Uppal; Guillaume Urtecho; Miles Richardson; Isin Y. Comba; Jeongchan Lee; Thomas Moody; Harris H. Wang; Georg K. Gerber

PMC · DOI:10.1186/s40168-025-02279-4·December 12, 2025

MCSPACE: inferring microbiome spatiotemporal dynamics from high-throughput co-localization data

Gurdip Uppal, Guillaume Urtecho, Miles Richardson, Isin Y. Comba, Jeongchan Lee, Thomas Moody, Harris H. Wang, Georg K. Gerber

PDF

Open Access

TL;DR

MCSPACE is a new AI tool that helps understand how gut microbes are arranged in space and time, using high-throughput data to track changes in their locations and responses to diet.

Contribution

MCSPACE introduces a probabilistic AI method for inferring spatially coherent microbial assemblages and their dynamics from co-localization data.

Findings

01

MCSPACE outperformed existing methods in benchmarking on longitudinal microbiome co-localization datasets.

02

The method identified temporally persistent and dynamic microbial assemblages in the human gut.

03

Dietary perturbations in mice induced shifts in gut microbial assemblages, revealed by MCSPACE analysis.

Abstract

Recent advances in high-throughput approaches for estimating co-localization of microbes, such as SAMPL-seq, allow characterization of the biogeography of the gut microbiome longitudinally and at an unprecedented scale. However, these high-dimensional data are complex and have unique noise properties. To address these challenges, we developed MCSPACE, a probabilistic AI method that infers, from microbiome co-localization data, spatially coherent assemblages of taxa, their dynamics over time, and their responses to perturbations. To evaluate MCSPACE’s capabilities, we generated the largest longitudinal microbiome co-localization dataset to date, profiling spatial relationships of microbes in the guts of mice subjected to serial dietary perturbations over 76 days. Analyses of these data and two existing human longitudinal datasets demonstrated superior benchmarking performance of MCSPACE…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species2

Mus musculus Homo sapiens(human · species)

Figures5

Click any figure to enlarge with its caption.

MCSPACE is a computational tool for inferring spatiotemporal dynamics of microbiomes from high-throughput sequencing-based co-localization data. This schematic depicts the flow from the physical input sample to the outputs produced by MCSPACE, with thick red arrows depicting inputs and outputs of the software tool. A A small portion of a larger fecal or tissue sample to be analyzed is shown. B High-throughput co-localization data, such as SAMPL-seq, are generated by fragmenting the input sample into small (e.g., ~ 40 µm) particles, with several representative particles shown. C DNA in particle

Compendium of SAMPL-seq longitudinal datasets including a new murine dataset with multiple dietary perturbations. A-H Reanalysis of data from a study of gut microbiome spatial co-localizations in a healthy human participant. A–D Daily collections from the participant,* n* = 5 fecal samples total. E–H Collections from the participant over 12 days, with inulin supplementation given on days 5–8 (*n* = 7 fecal samples total). I–L Longitudinal murine gut microbiome spatial co-localization study (*n* = 21 fecal samples total, collected from 3 mice). Dietary perturbations were selected to cause signi

MCSPACE accurately recovered spatial associations in semi-synthetic data and significantly outperformed comparator methods in predicting held-out real data. MCSPACE and five existing methods were evaluated: Fisher’s exact and SIM9, which detect pairwise interactions, and three types of Gaussian mixture models (GMMs), which infer multiway assemblages. A–F Semi-synthetic data were simulated from the MCSPACE model inferred from real data. Methods were assessed for their ability to recover ground-truth information in three tasks: A, B Detecting co-associated pairs of microbes, assessed with area u

MCSPACE analysis of longitudinal high-throughput co-localization data of the human gut microbiome revealed both dynamic and persistent spatial associations. Assemblages with ≥ threefold abundance increase at one time point compared with all other time points were defined as variable; assemblages with the variance of their log proportions over time < 1% were defined as persistent. OTUs that are present in either variable or persistent assemblages with > 5% frequencies are shown. Of note, variable assemblages primarily contained genera or species known to be capable of degrading resistant starch

MCSPACE analysis revealed nutrient-induced changes in spatial associations in the murine gut microbiome in our study with controlled dietary perturbations. Taxa to anchor differential spatial association mapping analyses were selected as those with a ≥ 5% relative abundance on ≥ 3 dietary intervals, yielding two taxa: *Akkermansia* OTU2 and *Lactobacillus* OTU1. Maps depict associations on S1 (standard chow), HF (high fat), HFHF (high fat, high fiber), and LP (low protein) diets. Node sizes indicate relative abundances of taxa, and edge widths indicate the strengths of spatial associations (<

Funding5

—https://doi.org/10.13039/100000011Howard Hughes Medical Institute
—National Science Foundation
—NIH NIGMS
—https://doi.org/10.13039/100011495Massachusetts Life Sciences Center
—Brigham and Women’s Hospital

Keywords

Generative AIBiogeographyComputationalLongitudinalMachine learningMicrobiomeSpatialSpatiotemporalTime-series

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMicrobial Community Ecology and Physiology · Gut microbiota and health · Vibrio bacteria research studies

Full text

Introduction

The mammalian gut harbors a highly complex and dynamic microbial ecosystem that modulates host physiology and is important for maintaining health [1–7]. Trillions of microbial cells, comprised of hundreds of diverse taxa colonizing throughout the gut, co-exist in a highly heterogeneous environment. Spatial features of this environment, including regional differences in nutrient availabilities, cross-feeding interactions [8], and attachment sites [9], strongly influence where particular taxa colonize. These features have functional implications, with spatial organization of microbes in the gut facilitating maintenance of biodiversity [10], microbe-microbe [11], and host-microbe [12] interactions, as well as the stability and plasticity of the gut ecosystem [13–15]. In addition, the microbiome is temporally dynamic, particularly in the context of environmental perturbations [9, 15, 16], such as changes in the host diet. Simultaneous spatial and temporal characterization of the gut microbiome is thus important for fully understanding this ecosystem, with potential to provide new insights into key questions including the kinetics of microbial community assembly [11, 17] and the development and maintenance of robust and stable microbial interdependencies [7, 13, 14, 18], which impact host physiology and disease pathophysiology [15, 16, 18].

Although spatiotemporal characterization of the microbiome has the potential to provide many exciting new insights, technical and practical challenges have historically limited the throughput of spatially resolved methods. Initial fluorescence in situ hybridization (FISH)-based methods were limited by spectral diversity and could only target a small number of taxa at a time [19–21]. Advancements in FISH probe multiplexing technologies have enabled the simultaneous detection of more taxa [19, 22, 23], and interesting recent approaches combining metabolic labeling with FISH [24] can provide multi-modal spatially resolved data. However, the throughput of these methods remains limited, because individual probes for each target microbe [22, 23] must be developed with considerable experimental optimization required [25, 26]. Sequencing-based approaches using laser microdissection to characterize the microbiome in regions of the gut [27, 28] do not require the design of individual probes, and can thus assess the full array of microbes present, but are limited in spatial resolution and throughput due to the microdissection process. Spatial transcriptomics technologies offer another promising approach for studying microbiome biogeography [12, 29], but are at present relatively low throughput and very costly.

Recently introduced sequencing-based technologies for elucidating spatial co-localization of microbes, such as MaPS-seq [30] and its successor SAMPL-seq [31] that further improved scalability using combinatorial split-and-pool barcoding [32–34], overcame many of the above-mentioned throughput challenges and are thus attractive methods for spatiotemporal profiling of the microbiome at ecosystem-scale. These methods can identify spatial co-localizations among hundreds of microbes without requiring prior specification of target taxa. Additionally, these methods have shown strong correlations between spatial associations in stool and gut tissues (i.e., co-localizations of microbes in gut tissue are recapitulated in fecal samples), which enables efficient and cost-effective sampling for studies of changes in the spatial co-localizations of the gut microbiome in the same individual over time. Protocols for these methods involve fixing samples in an acrylamide matrix, cryo-fracturing into particles, and passing particles through sizing filters, typically ~ 40 microns in diameter. A bar-coding procedure is then performed, in which DNA amplicons from microbes in the same particle receive identical sequencing bar codes. Finally, amplicons are released from the particles and sequencing is performed on standard instruments, with bioinformatic processing producing a table of counts per particle for each taxa identified in the sample (Fig. 1A-C). Computational tools are essential for analysis of these data, given their high-throughput nature and particular characteristics, including high-dimensionality, counts-based measurements, uneven amplification resulting in extremely variable numbers of reads per particle, and contamination from unencapsulated DNA mixing into particles [30]. However, existing tools have significant limitations. For instance, the most common approach in use binarizes the data, which discards potentially valuable quantitative information, and additionally restricts analyses to pairwise associations [11]. Clustering models have also been applied to these data [35], which can detect multi-way interactions, but this approach does not address other important properties of these data, including its unique noise characteristics. Additionally, none of the existing methods models temporal changes or perturbations introduced to the ecosystems under study.Fig. 1MCSPACE is a computational tool for inferring spatiotemporal dynamics of microbiomes from high-throughput sequencing-based co-localization data. This schematic depicts the flow from the physical input sample to the outputs produced by MCSPACE, with thick red arrows depicting inputs and outputs of the software tool. A A small portion of a larger fecal or tissue sample to be analyzed is shown. B High-throughput co-localization data, such as SAMPL-seq, are generated by fragmenting the input sample into small (e.g., ~ 40 µm) particles, with several representative particles shown. C DNA in particles is barcoded and then sequenced in bulk to produce a table with counts of operational taxonomic units (OTUs) in each particle. MCSPACE uses a purpose-built generative AI model to infer, from the input table of counts, microbial assemblages, or groups of recurrently spatially co-localizing microbes, and changes in abundance of the assemblages over time and due to experimentally introduced perturbations. D Proportions of assemblages in each sample are modeled as potentially changing over time and with perturbations. Redundant assemblages or irrelevant changes with perturbations are automatically eliminated during inference. E Observed sequencing reads for a particle are generated from the model by first picking an assemblage based on the assemblage proportions for the sample. Then, for each read in the particle, the model chooses to generate it either from the assemblage or from a contamination cluster that accounts for unencapsulated DNA mixing into particles. Assemblages and the contamination cluster are modeled as frequencies of OTUs. MCSPACE outputs three types of visualizations of the inferred model: F assemblage OTU frequency heatmaps organized by phylogeny, G assemblage proportion heatmaps across time and depicting perturbation influences, and H association network plots that depict changes in abundance and associations for anchoring taxa of interest

To enable characterization of microbiome spatiotemporal dynamics at the ecosystem scale, we developed a purpose-built computational tool, MCSPACE, for analyzing sequencing-based microbial co-localization data. MCSPACE, which we make available to the community as an open-source software package, introduces capabilities that address limitations of prior tools, namely: (1) characterization of spatially co-occurring groups of microbes, termed assemblages, to capture multi-way associations; (2) a custom noise model tailored for high-throughput spatial co-localization data; and (3) automatic determination of changes in assemblage abundances over time and due to introduced perturbations. To evaluate our method, we generated the most comprehensive time-series of microbiome spatial co-localization to date, consisting of SAMPL-seq analyses of 21 fecal samples from a cohort of three mice subjected to a series of dietary perturbations conducted over a 76-day period. Using this dataset and two existing longitudinal SAMPL-seq datasets of the human gut microbiome, we demonstrate that MCSPACE outperforms state-of-the-art methods and uncovers biologically relevant spatiotemporal dynamics in both human and murine gut microbiomes.

Results

MCSPACE is an open-source computational tool for inferring spatiotemporal dynamics of microbiomes from high-throughput microbiome co-localization data

We implemented a custom, generative AI-based model for identifying, from sequencing-based co-localization data such as SAMPL-seq (Fig. 1A–C), assemblages, or groups of microbes that repeatedly co-localize spatially, and detecting changes in the proportions of these assemblages over time and due to introduced perturbations. Note that assemblages model the frequencies of co-localizing microbes, not spatial orientations (e.g., taxa growing in concentric layers around one another), which could not be reconstructed from high-throughput co-localization data. MCSPACE employs a Bayesian probabilistic technique that models the observed count-based sequencing data as being generated by latent, or unobserved, random variables (Fig. 1D–F). These variables are learned from data using an approximate inference method based on a generative AI technique, Variational Autoencoders [36, 37], which is scalable and flexible, enabling modeling tailored to the specific properties of the data. Further, this method provides estimates of confidence, allowing the user to robustly interpret inferences derived from noisy data.

The MCSPACE model was purpose-built to capture key features of high-throughput microbiome co-localization data (Fig. 1E, F). Observed counts of operational taxonomic units (OTUs) in particles are assumed to arise from a set of latent assemblages that consist of frequencies of OTUs. MCSPACE automatically determines the number of assemblages using a Bayesian variable selection (BVS) technique, with a bias toward parsimony (e.g., the lowest number of assemblages warranted by the data). By biasing toward parsimony, MCSPACE guards against overfitting the data, which in the context of our unsupervised generative model would mean “memorizing” individual particles rather than finding recurrent assemblages. Proportions of assemblages are assumed to change over time (Fig. 1E) and may also change with introduced perturbations, but with a bias toward no changes modeled using a BVS technique. Observed reads for each particle are thus assumed to be generated by the following process (Fig. 1F): (1) each particle probabilistically chooses the assemblage it comes from, according to the assemblage proportions for the sample; (2) each observed read chooses to either be generated from a contamination cluster (modeling mixing from unencapsulated DNA), or from the particle’s chosen assemblage; and (3) the read chooses which OTU it comes from, according to the OTU frequencies of either the assemblage or contamination cluster. See the “ Methods” section for full details on the model and inference procedure.

Inputs to the open-source software package are (1) a table of microbial sequencing read counts per particle, in each sample, (2) a table of taxonomic labels for OTUs, and (3) metadata with a time-point index for each sample, and a flag for each time-point indicating whether a perturbation was applied. The software outputs a set of tables and visualizations to facilitate interpretation of its results by biologists, including visualizations of (Fig. 1G–I): (1) members of assemblages, organized into phylogenetic trees; (2) the proportions of assemblages in each sample, and whether each assemblage has been identified as being influenced by each perturbation; and (3) changes in spatial associations between taxa over time.

Human and mouse SAMPL-seq spatial co-localization dataset compendium

For investigating spatiotemporal dynamics of the human gut microbiome, we used two existing longitudinal SAMPL-seq datasets. The first dataset consisted of fecal samples collected from a healthy human donor eating an ad libitum diet over five consecutive days [31] (Fig. 2A, B), and the second consisted of samples from the same donor eating an ad libitum diet, collected over a 12-day period, including inulin supplementation given on days 5–8 [31] (Fig. 2E, F). Raw read counts obtained from the studies were filtered for quality (see the “Methods” section, Figs. S1 and S3). After quality filtering, the first dataset had 58 OTUs and 7055 particles (~ 40 µm in diameter), with a median of 1419 (interquartile range [IQR] = 451) particles per sample (Figs. 2C and S2), and the second dataset had 50 OTUs and 13,250 particles (~ 40 µm in diameter), with a median of 1673 (IQR = 770) particles per sample (Figs. 2G and S3). Figures 2D and 2G visualize the number of reads per particle (median = 843, IQR = 1828 for the first dataset and median = 708, IQR = 1364 for the second). Of note, as seen in the original study [31], the baseline microbiome measured by SAMPL-seq differed substantially between the two datasets (e.g., several Bacteroides taxa detected in the first dataset were absent in the second).Fig. 2. Compendium of SAMPL-seq longitudinal datasets including a new murine dataset with multiple dietary perturbations. A-H Reanalysis of data from a study of gut microbiome spatial co-localizations in a healthy human participant. A–D Daily collections from the participant,* n* = 5 fecal samples total. E–H Collections from the participant over 12 days, with inulin supplementation given on days 5–8 (n = 7 fecal samples total). I–L Longitudinal murine gut microbiome spatial co-localization study (n = 21 fecal samples total, collected from 3 mice). Dietary perturbations were selected to cause significant and differential changes to the composition of the microbiota. S1–S4 = standard chow feeding periods. B, F,** J** Microbial compositions of datasets at the family level, aggregated over particles. For the mouse dataset, the mean proportions are shown over the animals. C, G, and K particle counts and D, H, and L reads per particle distributions after quality filtering. Boxplot centers in panels D, H, and L show medians, with boxes representing interquartile ranges (IQRs). Whiskers extend to data points within 1.5 × IQR

To more comprehensively investigate microbiome spatiotemporal dynamics, we performed a study in mice (Figs. 2I and S4). A cohort of three C57BL/6 Jackson Laboratory (Jax) mice was subjected to defined dietary changes over a 76-day period. After mice were equilibrated on standard chow for 10 days, each dietary perturbation was introduced in sequence, with intervening periods of 2 weeks on standard chow for washout [7, 18] of the perturbation diet. We selected the perturbation diets–high-fat (HF), high-fat high-fiber (HFHF), and low-protein (LP)–for their known strong and distinct effects on microbial composition, with each differentially affecting microbes that preferentially utilize certain carbohydrate, fat, protein, and fiber sources [7, 38, 39]. Details on the composition of the diets are provided in Table S1. Fecal samples were collected from each mouse on each diet after equilibration, at days 10, 18, 35, 43, 57, 65, and 76 (Fig. 2I), yielding 21 samples for SAMPL-seq analyses (Fig. 2J). After quality filtering, the dataset (Figs. 2K and S4) consisted of 74 OTUs and a median of 2829 particles (~ 40 µm in diameter) per sample (IQR = 1319). Figure 2L visualizes the number of reads per particle (median = 745; IQR = 1390). For both the mouse and human datasets, although the relatively short 16S rRNA amplicon sequences produced by SAMPL-seq limited taxonomic resolution, in most cases OTUs were resolvable to at least genus level, with many resolvable to species level, enabling subsequent interpretations of results in terms of canonical traits at the genus or species level (e.g., likely butyrate production).

MCSPACE accurately recovers spatial associations from semi-synthetic data

MCSPACE’s utility hinges on its ability to accurately recover spatial associations from high-throughput data. There are multiple reasons why a machine learning model can fail to accurately recover an underlying latent structure (spatial associations, in our case) including model misspecification, inefficient or incorrect inference algorithms, and overfitting. Of note, although the phenomenon of overfitting is most commonly described in the context of supervised machine learning methods (i.e., those that predict labels given input data), the general concept is also applicable to unsupervised machine learning models, such as MCSPACE. In this context, overfitting intuitively means that the model captures individualized, noisy patterns in data rather than patterns reflecting generalizable properties of the population of interest. Although there are no canonical metrics for overfitting in unsupervised models, a relevant metric is a model’s ability to accurately recover underlying latent structures. To assess this type of metric, ground-truth information is required. Because no comprehensive source of ground-truth microbiome associations exists, we employed a bootstrapping-type procedure that generates semi-synthetic data from the MCSPACE model while preserving distributional properties of real SAMPL-seq data. Briefly, data were simulated using the 34 latent assemblages inferred at the first time-point of the human dataset. The human dataset was chosen as it contained more OTUs per particle (Fig. S2C) and thus presents a more challenging task for recovering spatial associations. We simulated datasets that varied the number of particles or reads per particle to assess how these features of SAMPL-seq data could affect MCSPACE’s performance. Values for the numbers of particles and reads per particle were chosen to mimic the variability of dataset sizes seen in SAMPL-seq runs, with values of 250, 500, 1000, 2500, and 5000 used, spanning approximately an order of magnitude below and above the values observed in real data. Each feature was varied individually, whereas the other was kept constant (set to the value inferred on real data). See the “ Methods” section for details on our data simulation procedure.

We additionally used the semi-synthetic data to evaluate computational methods that have previously been used for analyzing high-throughput microbial spatial co-localization data. The semi-synthetic data, which were simulated from the MCSPACE model itself, obviously cannot be used to draw conclusions about the absolute performances of other methods on real data. However, these analyses can provide insights into how features of the data that vary with experiments, such as the depth of sequencing, could affect results for different computational methods in a relative sense. Two of the comparator methods [30, 31] detect pair-wise associations by binarizing data, then using either Fisher’s exact test or generating a null distribution based on the SIM9 algorithm from the ecology literature [40], to determine statistical significance. A third method [35], Gaussian Mixture Models (GMMs), a standard clustering approach, identifies clusters of particles and is thus capable of finding multi-way interactions. Note that a purpose-built directional GMM model including directional gradients over clusters [35] has also previously been applied to high-throughput microbiome spatial co-localization data. This model assumes directional gradients either along the longitudinal axis (directional GMM 1D) or along both the longitudinal and radial axes (directional GMM 2D) of the gut. We evaluated this method on real data, but were unable to do so on the semi-synthetic data because the method failed to converge on more than 10 clusters, making a fair comparison to the other methods on semi-synthetic data infeasible.

Detecting pairs of spatially co-associated microbes

Pair-wise associations are a frequent unit of analysis in microbial ecosystems [18, 41–43], so we wanted to assess whether the existing algorithms [30, 31] for recovering these types of associations from high-throughput microbiome spatial co-localization data could do so in the setting of underlying multi-way spatial associations, which occur in real microbial communities [11]. We evaluated performance using the area under the receiver operator curve (AUC) for the ability to detect pairs of associations. Although both the Fisher’s exact and SIM9 methods did fairly well, achieving AUCs of around 0.80 and 0.75, respectively (Fig. 3A, B), MCSPACE and the GMM significantly outperformed the other methods. MCSPACE and the GMM performed on par, except in the cases of the lowest numbers of reads or particles, in which MCSPACE significantly outperformed the GMM. These results demonstrate that multi-way-aware methods are better able to detect pairwise associations in data with underlying multi-way associations and suggest that although prior pair-wise detection methods perform reasonably well on this task, there is room for improvement.Fig. 3MCSPACE accurately recovered spatial associations in semi-synthetic data and significantly outperformed comparator methods in predicting held-out real data. MCSPACE and five existing methods were evaluated: Fisher’s exact and SIM9, which detect pairwise interactions, and three types of Gaussian mixture models (GMMs), which infer multiway assemblages. A–F Semi-synthetic data were simulated from the MCSPACE model inferred from real data. Methods were assessed for their ability to recover ground-truth information in three tasks: A, B Detecting co-associated pairs of microbes, assessed with area under the receiver operator curve (AUC) (higher values indicate superior performance). C, D Recovering the correct frequencies of OTUs in assemblages, assessed using assemblage recovery error (lower values indicate superior performance), and E, F inferring the correct number of assemblages, assessed by subtracting the inferred value from the correct value (values nearer to zero indicate superior performance) Methods were benchmarked on real human (G) and mouse (H) data for their ability to predict held-out sequencing reads using fivefold cross-validated cosine distance as the evaluation metric (lower values indicate superior performance). Boxplots show results from 10 simulated replicates for synthetic data or all particles post-filtering in real data (n = 7055 for human and n = 56,848 for mouse). Central lines indicate medians with boxes representing interquartile ranges (IQRs). Whiskers extend to data points within 1.5 × IQR. Statistical significance was assessed with a Wilcoxon rank sum test followed by Benjamini–Hochberg correction for multiple hypothesis testing (*p < 0.05; **p < 0.01, ***p < 0.001, and ****p < 0.0001)

Recovering spatial assemblages

We next assessed whether models (GMMs and MCSPACE) capable of recovering multi-way spatial associations, or assemblages, could do so accurately. To assess this capability, we developed an assemblage recovery error metric, which matches inferred and true assemblages and calculates distances between matched assemblages (Fig. S5); see the “ Methods” section for details. GMMs performed poorly overall on this task, showing no clear improvement with more particles and only slight improvement with increasing read depth (Fig. 3C, D). In contrast, MCSPACE performed significantly better (Wilcoxon rank sum test,* p* < 0.001), up to ninefold, and showed steady gains with increasing numbers of particles and read depths. To gain further insight into these differences in performance, we evaluated how well GMMs and MCSPACE could recover the underlying number of assemblages (Fig. 3E, F). GMMs generally substantially overestimated the number of assemblages, although the method underestimated the number of assemblages in the setting of low numbers of particles; there was no clear trend of improving performance with additional data. In contrast, MCSPACE’s performance was significantly better (Wilcoxon rank sum test,* p* < 0.001), with performance steadily improving with more data, especially with increasing numbers of reads per particle up to a point of saturation at 2500 reads per particle. Of note, this particular point of saturation reflects the simulated complexity of the semi-synthetic data and would be expected to vary in real data depending on the number of actual OTUs and spatial interactions present in the ecosystem under study. These results demonstrate that GMMs have difficulty recovering microbiome spatial assemblages from data with realistic parameters, and there thus may be advantages to a purpose-built model such as MCSPACE.

MCSPACE outperformed comparator methods on real data from both humans and mice

To assess MCSPACE’s and the comparator techniques’ performance on real data, we evaluated the ability of the methods to predict held-out data. This is a common benchmark for generative models in the machine learning literature (e.g., document completion tasks in topic models [44]), which assesses models’ abilities to capture distributional properties of real data accurately and can also be viewed as a metric assessing overfitting for unsupervised machine learning models. We assessed performance via a five-fold cross-validation procedure, with each fold trained on 80% of particles containing all their sequencing reads and 20% of particles containing 50% of their reads (with 50% held out), with the task of predicting the held-out reads. This procedure, of not holding out the entirety of individual units of analysis (particles, in our case), is standard in the machine learning literature and is necessary to provide appropriate context for the model to reconstruct the held-out data. We used the cosine distance between predicted and held-out data as the performance metric (lower values indicate superior performance). See the “ Methods” section for details on the evaluation procedure.

For this task, we compared MCSPACE against standard and directional GMMs [35], which are also generative models. For each cross-validated fold, MCSPACE was run with a capacity of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=100$$\end{document}$ assemblages. The standard GMM was run for a range of 2 to 100 assemblages, and the best model was selected post hoc as described in the “ Methods” section. Directional GMMs failed to converge when the number of assemblages exceeded 10 and were therefore only run up to 10 assemblages. Both standard and directional GMMs inherently assume data is normally distributed and thus model transformed data, rather than the actual data, which consists of sequencing counts. Also, these methods use a post hoc procedure to control model sparsity, unlike MCSPACE, which controls sparsity directly as part of its Bayesian model.

MCSPACE significantly outperformed all comparator models on this task on our compendium containing both human and mouse datasets (Wilcoxon rank sum test, p < 0.0001, Fig. 3G, H). On human data, MCSPACE had a median cosine distance of 0.16, whereas the GMM models ranged from 0.71 to 0.85; on mouse data, MCSPACE’s median cosine distance was 0.005 compared with GMM models ranging from 0.09 to 0.69. These results demonstrate that MCSPACE, which was purpose-built to capture key aspects of real data such as modeling sequencing counts and particle contamination, outperforms models without these capabilities in accurately reproducing real high-throughput microbiome co-localization data.

Spatiotemporal dynamics of human and murine gut microbiomes

MCSPACE elucidated both dynamic and persistent associations in the human gut microbiome with ad libitum diet

To gain insights into the spatiotemporal dynamics of the human gut microbiome, we first applied MCSPACE to the dataset from our compendium profiling fecal samples from a single healthy individual over five consecutive days. MCSPACE identified 58 OTUs assorting into 40 spatial assemblages, with assemblages dominated by abundant bacteria such as Phocaeicola dorei OTU3, present in 22 of 40 (55%) assemblages at ≥ 5% abundance, and Agathobacter rectalis OTU1, present in 14 of 40 (35%) assemblages at ≥ 5% abundance (Fig. S7). These findings are consistent with previous research [31] that identified these species as central “spatial hubs” in the gut.

To assess the dynamic versus persistent nature of assemblages, we examined their temporal proportions and identified the assemblages exhibiting the most and least variability. We defined an assemblage as variable if it was at least threefold more abundant at one time point compared with all other time points, or persistent if the standard deviation of its log proportions over time was less than 1%. Using this definition, we identified three variables and four persistent assemblages (Fig. 4), which exhibited distinct sets of abundant OTUs. Notably, the variable assemblages contained many species previously associated with resistant starch or plant polysaccharide degradation [45–50, 52], such as assemblages A34 and A31, both peaking on day 1 and containing Agathobacter rectalis OTU1 and Ruminococcus bromii OTU10. The persistent assemblages A25, A13, and A6 were each dominated by a single Lachnospiraceae member: Blautia faecis OTU23, Mediterranibacter OTU17, or Roseburia OTU12. Of note, these species or genera are abundant in the gut and have been shown to produce butyrate from a variety of substrates that are common in the Western diet, as well as from host-derived sources including mucin [51–56].Fig. 4MCSPACE analysis of longitudinal high-throughput co-localization data of the human gut microbiome revealed both dynamic and persistent spatial associations. Assemblages with ≥ threefold abundance increase at one time point compared with all other time points were defined as variable; assemblages with the variance of their log proportions over time < 1% were defined as persistent. OTUs that are present in either variable or persistent assemblages with > 5% frequencies are shown. Of note, variable assemblages primarily contained genera or species known to be capable of degrading resistant starches (e.g., Agathobacter rectalis and Ruminococcus bromii) and/or plant fiber polysaccharides (e.g., Roseburia inulinivorans and Bacteroides uniformis). In contrast, persistent assemblages contained many genera or species that have been shown to use nutrients either from cross-feeding interactions (e.g., Phascolarctobacterium faecium is a succinate-consumer and Phocaeicola dorei is a producer) or from ubiquitous diet- or host-derived sources (e.g., Blautia, Mediterranibacter and Roseburia species produce butyrate from a variety of substrates that are common in the Western diet, as well as from host-derived mucin)

MCSPACE revealed shifts in spatial associations among taxa in the human gut microbiome subjected to inulin perturbation

We next applied MCSPACE to the second human dataset with an inulin perturbation, which identified 50 OTUs partitioning into 42 spatial assemblages (Fig. S8). Assemblage A28 had the highest evidence for an increase (Bayes factor = 3.1) with the inulin perturbation and was dominated by Bifidobacterium OTU6 (41%) and Agathobacter rectalis OTU1 (54%), taxa that have previously been shown to increase with inulin supplementation in humans [31, 54–56]. Interestingly, assemblages that increased in their proportions with inulin overall harbored significantly more OTUs than those assemblages that decreased in their proportions (Wilcoxon rank sum test, p < 0.002).

To gain further understanding of the effects of the inulin perturbation on microbial spatial associations, we performed an analysis of changes in spatial associations anchored on key taxa. To conduct this analysis, we developed a pairwise metric that scores the degree of spatial co-localization between an anchoring taxon and those it associates with (see the “ Methods” section) and maps that visualize gains and losses of associations across conditions (Fig. S9). We chose Bifidobacterium OTU6 and A. rectalis OTU1 as the anchoring taxa because these taxa dominated the assemblage with the most evidence of change with the inulin perturbation and have also been previously identified as inulin-consumers [31, 56]. In our spatial change analysis, each taxon was seen to increase both the numbers and the magnitudes of its spatial co-localization scores with the inulin perturbation. The strongest increases in association score magnitudes were seen between Bifidobacterium OTU6 and A. rectalis OTU1, with many other associations also gained or strengthened with the inulin perturbation, including with Ruminococcus bromii OTU10, Monoglobus OTU59, Ruminococcus faecis/torques OTU17, Blautia OTU8, Lachnospiraceae OTU82, Blautia masiliensis OTU13, and Fusicatenibacter saccharivorans OTU9. Interestingly, although some changes in associations were transient and restricted to the supplementation period, others persisted into the post-inulin phase. These findings demonstrate MCSPACE’s ability to uncover both specific and broad alterations in spatial associations in the human gut microbiome in response to dietary perturbation.

MCSPACE uncovered diet-induced spatial redistributions in the murine gut microbiome

We next applied MCSPACE to the mouse SAMPL-seq dataset that included controlled diets, multiple perturbations, and biological replicates, which we thus expected could yield further information on the impact of dietary nutrients on the spatiotemporal dynamics of the gut microbiome beyond the human dataset generated from a single individual on an ad libitum diet. On the mouse dataset, MCSPACE identified 74 OTUs assorted into 63 spatial assemblages (Fig. S10). Of these assemblages, MCSPACE found strong evidence [57] (Bayes factor > 10) that the abundances of 14 were perturbed by at least one of the three dietary interventions compared with their abundances on the preceding standard chow diet period (Fig. S11): five assemblages were perturbed by the HF diet, three by the HFHF diet, and eight by the LP diet, with two assemblages perturbed by multiple diets (A18 by both the HF and LP diets; A3 by both the HFHF and LP diets). These findings echo those on the human datasets and provide further evidence that dietary changes can have significant impact on the spatial organization of the gut microbiome, with both specific and broad effects evident.

To assess how gut microbial spatial associations changed with multiple dietary perturbations, we performed an anchoring taxon analysis similar to that described above for the human dataset. For the mouse dataset, in contrast to the human dataset, multiple dietary perturbations were performed, and prior literature did not clearly suggest key taxa of relevance across the perturbations. We thus took an exploratory approach to select anchoring taxa based on abundance and consistency thresholds. Specifically, anchoring taxa were selected as those with a minimum relative abundance of 5% on at least three of the four diets. This criterion, which yielded two taxa, Akkermansia OTU2 and Lactobacillus OTU1, was used to avoid anchoring on taxa that dropped to low abundance in some conditions and would therefore limit power to detect robust spatial co-localizations and increase the risk of finding spurious associations. The first anchoring taxon, Akkermansia OTU2, showed extensive reorganization of its spatial associations (Fig. 5) with diet. Although this taxon had only modestly higher relative abundance on the HF diet (5.8%, compared with 2.2% on standard chow), it exhibited many more spatial associations than on standard chow, including with Bacteroidales OTU25, Muribaculum gordoncarteri OTU20, Clostridium XVIII OTU34, Lacrimispora OTU26, and Romboutsia OTU29. Interestingly, on the HFHF diet, which uses inulin and pectin as fiber sources versus the other diets that use cellulose, Akkermansia OTU2’s relative abundance was dramatically higher (40%), but other taxa lost all spatial associations with it. In contrast, although Akkermansia OTU2’s relative abundance was also high (23%) on the LP diet, which has reduced protein content and increased sucrose content relative to the other diets, multiple taxa retained their associations with Akkermansia OTU2. The second anchoring taxon, Lactobacillus OTU1, also showed extensive reorganization of its spatial associations (Fig. 5) with diet. This taxon had associations with three taxa on standard chow and four on the HF diet. Similar to Akkermansia OTU2, Lactobacillus OTU1 lost its associations on the HFHF diet. Interestingly, Lactobacillus OTU1’s relative abundance was substantially lower on the LP diet (1.3% versus 41% on standard chow, 11% on the HF diet, and 20% on the HFHF diet), whereas it exhibited the most spatial associations on this diet, including with members of the Lachnospiraceae family (Lachnospiraceae OTU43, Roseburia OTU17, Cellulosilyticum OTU38, Eisenbergiella OTU18) as well as with Romboutsia OTU29 and Ruminococcaceae OTU9. These findings demonstrate MCSPACE’s ability to uncover complex spatiotemporal dynamics in the microbiome that occur with multiple dietary changes.Fig. 5MCSPACE analysis revealed nutrient-induced changes in spatial associations in the murine gut microbiome in our study with controlled dietary perturbations. Taxa to anchor differential spatial association mapping analyses were selected as those with a ≥ 5% relative abundance on ≥ 3 dietary intervals, yielding two taxa: Akkermansia OTU2 and Lactobacillus OTU1. Maps depict associations on S1 (standard chow), HF (high fat), HFHF (high fat, high fiber), and LP (low protein) diets. Node sizes indicate relative abundances of taxa, and edge widths indicate the strengths of spatial associations (< 0.01 not shown). Akkermansia OTU2’s spatial associations shifted significantly with the fiber type in the diet, persisting on insoluble cellulose (S1, HF, LP) and disappearing on the soluble fibers inulin and pectin (HFHF). Lactobacillus OTU1 similarly lost associations with soluble fibers in the diet (HFHF). Additionally, when Lactobacillus OTU1’s preferred fermentation substrate casein was significantly reduced (LP), the taxon decreased in abundance while increasing its spatial associations, suggesting that Lactobacillus OTU1 may engage in cross-feeding with taxa such as Cellulosilyticum OTU38 (a degrader of dietary cellulose) when preferred nutrients are not readily available

Discussion

We have introduced MCSPACE, an open-source software package that implements a custom generative AI-based method for extracting interpretable insights from high-throughput spatiotemporal microbiome data. To evaluate our method, we compiled a compendium of human and mouse spatiotemporal data, including a new murine dataset that we generated, which is the most comprehensive microbiome co-localization dataset to date. Using semi-synthetic data, we demonstrated that MCSPACE could accurately recover spatial associations among taxa, and on the compendium of real datasets, we showed that MCSPACE significantly outperformed state-of-the-art methods in predicting held-out data. Moreover, we showed that MCSPACE discovered interpretable spatial microbial assemblages, including temporally transient and persistent assemblages in the human gut microbiome and “hub” taxa that exhibit shifting spatial associations in the human and mouse gut microbiome in response to dietary components.

Our computational model was purpose-built for its tasks and thus offers distinct advantages over generic methods. Specifically, MCSPACE is a generative Bayesian method, employing a principled probabilistic framework that directly models irregular, sparse, and noisy counts-based data. In contrast, state-of-the-art comparator methods do not directly model the underlying data but instead transform it using techniques such as binarization and log-ratio functions, which introduce distortions of the data and lose important information about its variability [58]. Our Bayesian method also has the advantage of producing quantitative measures of the uncertainty of its estimates, including the number of assemblages and the influences of perturbations, which allows the user to make informed decisions about which relationships warrant further investigation versus those that are more likely to be spurious. Indeed, the advantages of our approach were demonstrated by our superior benchmarking results. A challenge with Bayesian methods, such as ours, is that despite their strong capabilities, they are computationally costly to infer. We addresseded this challenge by developing an efficient inference algorithm that approximates the MCSPACE model using generative AI techniques, while maintaining the highly interpretable structure of our model.

Application of MCSPACE to the human datasets suggests that both temporally dynamic and persistent microbial assemblages occur in the gut, and host diet impacts their transience or durability. For example, in the first dataset, which tracked the spatiotemporal dynamics of a microbiome without specific perturbations, assemblages dominated by A. rectalis and R. bromii transiently spiked on day 1. Given that R. bromii is a resistant-starch degrader previously shown in co-culture to enhance the growth of *A. rectalis *[59]. One possible explanation for this finding is that the assemblages occur due to colonization of starch granules by R. bromii (with spikes driven by variable starch consumption in the ad libitum diet) and A. rectalis co-localizing to capitalize on the syntrophic interaction. Durable assemblages, in contrast, were dominated by species that are generally abundant in the human gut microbiome and are known to metabolize a variety of substrates that are either ubiquitous in the human diet or are host-derived [52–56]. Interestingly, we also found a possible syntrophic interaction in the most durable assemblage, A26, which was dominated by Phascolarctobacterium faecium and Phocaeicola dorei. P. faecium is a known succinate consumer [45, 60], whereas P. dorei produces succinate during central metabolism [46], suggesting a cross-feeding interaction between them. In the second human dataset involving inulin supplementation, similarly to an earlier study [31], we found an increase in microbial spatial co-localization with the inulin perturbation. However, in contrast to the prior study that used conventional statistical techniques and did not find evidence of changes in spatial co-localizations with Bifidobacterium species (primary degraders that hydrolyze inulin into mono- and oligosaccharides [61, 62]), our analyses using MCSPACE found many biologically relevant changes. In particular, our anchoring taxon analyses showed increases in both the numbers and the magnitudes of a Bifidobacterium OTU’s spatial co-localization scores in response to the inulin perturbation. These co-localization changes were with both known primary and secondary inulin fermenters, including A. rectalis, R. bromii, Monoglobus sp., F. saccharivorans, Blautia sp., Lachnospiraceae sp., and R. faecis/torques. Interestingly, some changes were seen to persist beyond the inulin supplementation period, which may reflect the short interval in the study between inulin withdrawal and post-intervention sampling (two days), during which stool may still capture dietary exposures due to gut transit times. These results align with prior reports of inulin’s bifidogenic and butyrogenic effects in the human gut and provide new insights on spatial co-localizations underlying these responses [52–56].

Results from our murine dataset, which used controlled diets with multiple nutrient perturbations, provided further insights into how microbial communities spatially restructure in a context-dependent manner. For instance, MCSPACE analyses revealed that Akkermansia and Lactobacillus, taxa with marked nutrient preferences, dramatically changed their spatial associations with shifts in carbon and nitrogen sources. Soluble fibers, such as inulin and pectin, have been shown to be readily fermented to produce short-chain fatty acids by a broad range of microbes [63–65], including Akkermansia. In contrast, insoluble fibers such as cellulose are not readily metabolized by most microbes, but can stimulate production of host glycans such as mucin [66, 67], which indirectly benefits Akkermansia, a mucin degrader [68]. Our findings thus suggest the hypothesis that in the appropriate context, insoluble fibers available from the diet may promote cross-feeding and gains in microbial spatial co-localizations organized around Akkermansia as a hub organism that can degrade host glycans and produce a variety of metabolic byproducts useful to other bacteria [69–71], whereas insoluble fibers in the diet can promote independent growth [63–65] of Akkermansia. In the case of Lactobacillus, species in this genus preferentially ferment casein [72], and although all the diets in our experiments used casein as a protein source, the LP diet contained approximately one-third the casein of the other diets. When casein is not available, Lactobacillus species can also efficiently utilize other energy sources, including soluble fibers such as inulin and pectin [65, 73]. Our findings thus suggest the hypothesis that Lactobacillus species grow relatively independently in the gut when casein and soluble fibers are readily available in the diet (i.e., on the HFHF diet), whereas it co-localize to engage in cross-feeding interactions when preferred energy sources are less available, such as interacting with Cellulosilyticum sp., a degrader of dietary cellulose [74], on the LP diet. Interestingly, we saw persistence of the decrease in Lactobacillaceae during the LP diet into the next dietary interval on standard chow (S4) (mean of 31% on S1–S3 versus 4% on S4; Wilcoxon rank sum test, p < 0.05), with a concurrent increase in Lachnospiraceae (mean of 23% on S1–S3 versus 52% on S4; Wilcoxon rank sum test, p < 0.05), suggesting that the LP diet may cause longer term changes to the gut microbiome. Overall, our results demonstrated MCSPACE’s power to characterize complex and dynamic spatial co-localization within the gut microbiome subjected to multiple perturbations and to suggest specific hypotheses about the influence of host diet on ecological relationships among microbes. Of note, unlike in the human dataset, in the mouse dataset Bifidobacterium species were not detected. Further, in our mouse experiments, controlled diets were employed, whereas in the human study inulin supplementation was given with an ad libitum diet. Thus, the differences between findings in the human and mouse experiments related to inulin perturbations highlight the importance of context (underlying taxa present, base diet, etc.) in comparative microbiome studies.

Our study had several limitations, which suggest multiple directions for future work. First, in this study, we evaluated MCSPACE on SAMPL-seq data from fecal samples. Our method is general and could potentially be applied to other spatial co-localization data with similar tabular formats and from other sample types, including tissue, soil, and other environmental specimens. However, such applications would require further testing and validation of MCSPACE. Second, the SAMPL-seq method only generates amplicons covering 69 bases of the 16S rRNA gene, which limits phylogenetic resolution. A consequence of low phylogenetic resolution, in addition to limiting taxonomic resolution, is that gene content, including 16S rRNA gene copy numbers, cannot be accurately imputed. Third, each dataset in our compendium was generated to characterize a single range of particle sizes. MCSPACE could be readily extended to handle multiple particle sizes, and with such data available, we expect our method would increase its overall accuracy in recovering spatial assemblages as well as provide new insights into assemblages at larger scales. Fourth, the datasets we analyzed employed a relatively limited set of perturbations. For our mouse studies, we chose these dietary perturbations because they caused large shifts in the microbiome, and we indeed saw dramatic changes in spatial co-localizations that could be attributed to microbial preferences for carbon and nitrogen sources. In future work, it would be interesting to explore a wider variety of diets and to quantitatively vary dietary components to establish the relationship between specific nutrients and microbial spatial structuring. Additionally, it would be interesting to study the effects of a broader range of perturbations, such as antibiotics, phages, and host pathogens. Fifth and finally, although the data we trained MCSPACE on, SAMPL-seq, has the advantage of being high-throughput, it has the disadvantage of lacking in situ spatial context. In future work, it would be interesting to incorporate in situ data, such as MERFISH [75, 76], GeoMX [77], or Xenium [78], as additional modalities informing assemblage membership. Data capable of spatially resolving metabolites or cell surface molecules, such as MALDI MSI [79], would be particularly interesting to incorporate. These data could provide direct support for hypotheses generated by MCSPACE, including cross-feeding relationships that exist in the complex gut milieu but may be difficult to replicate in vitro.

In conclusion, our work provides new resources for characterizing the spatiotemporal dynamics of the microbiome at the ecosystem-level scale. We introduced the MCSPACE software, which makes sophisticated probabilistic AI capabilities available to researchers studying these complex ecosystems. Further, we have generated a comprehensive spatiotemporal dataset of the mouse gut microbiome that includes dietary perturbations, which we believe will serve as a valuable benchmarking and analytical resource for the community. Overall, our work provides a foundation for analyzing the spatial organization of microbiomes, which holds promise for unraveling the mechanisms of microbe–microbe and host-microbe interactions and developing means to harness the microbiome to promote health and ameliorate disease in the host.

Methods

Key resource table

Reagent or resourceSourceIdentifierBiological samples Fecal samples from a cohort of C57BL/6 Jackson Laboratory miceThis paperN/ADeposited data 16S rDNA amplicon sequencing dataThis paperSRA: PRJNA1182308 Model inference and analysis resultsThis paperhttps://zenodo.org/records/14166041Experimental models: organisms/strains Mouse: C57BL/6The Jackson LaboratoryIMSR_JAX:000664Software and algorithms MCSPACEThis paperhttps://github.com/gerberlab/MCSPACE Directional Gaussian Mixture ModelsPasarkar et al. [35]https://github.com/amepas/Spatial_Mbiome USEARCH v11.0.667Edgar [80]https://drive5.com/usearch/ VSEARCH v2.28.1Rognes et al. [81]https://github.com/torognes/vsearch ULTRAPLEX v1.2.5Wilkins et al. [82]https://github.com/ulelab/ultraplex SeqKit v2.8.2Shen et al. [83]https://github.com/shenwei356/seqkit R v4.3.2R Core Teamhttps://www.r-project.org/ DNABarcodesBuschmann [84]https://bioconductor.org/packages/DNABarcodes/ Barcode correction R scriptThis paperhttps://github.com/gerberlab/MCSPACE/blob/main/mcspace/MCSPACE_paper/scripts/read_processing/Barcode%20Correction RDP classifierWang et al. [85]https://github.com/rdpstaff/classifier gappa v0.8.5Czech et al. [86]https://github.com/lczech/gappa pplacer v1.1.alpha19Matsen et al. [87]https://github.com/matsen/pplacer Python v3.10.14Python Software Foundationhttps://www.python.org/ PyTorch v2.4.0 + cu118Paszke et al. [88]https://pytorch.org/ Numpy v1.26.4Harris et al. [89]https://numpy.org/ Scipy v1.14.0Virtanen et al. [90]https://scipy.org/ Matplotlib v3.9.2Hunter et al. [91]https://matplotlib.org/ Seaborn v0.13.2Waskom et al. [92]https://seaborn.pydata.org/ Source code for analysesThis paperhttps://github.com/gerberlab/MCSPACE/tree/main/mcspace/MCSPACE_paperOther Standard rodent dietEnvigo-TekladTD.08806 High fat dietEnvigo-TekladTD.06414 High fat high fiber dietEnvigo-TekladTD.210230 Low protein dietEnvigo-TekladTD.160634 SAMPL-seq protocolRichardson et al. [31]https://www.nature.com/articles/s41564-024-01914-4 SAMPL-seq data for the human time series datasetRichardson et al. [31]https://www.nature.com/articles/s41564-024-01914-4 SAMPL-seq data for human time series with inulin perturbation datasetRichardson et al. [31]https://www.nature.com/articles/s41564-024-01914-4

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Georg K. Gerber ([email protected]).

Materials availability

This study did not generate new unique reagents.

Data and code availability

Raw sequencing data are available through SRA, and accession numbers are listed in the key resources table.
The MCSPACE tool is publicly available with an open-source license on GitHub: https://github.com/gerberlab/MCSPACE. Code and processed data needed to generate figures from this study, Python notebooks with instructions for data processing, and all data necessary to run MCSPACE on example datasets are included. Additional files containing inference results from running the models are available on the Zenodo repositories. Links to repositories are listed in the key resources table.
All information required to reanalyze the data is available on GitHub and through the Zenodo link.

Experimental model and study participant details

Human study participants

No human subjects research was conducted as part of this study.

Mouse experiments

Experiments were conducted under Columbia University Medical Center Institutional Animal Care and Use Committee protocol AC-AABD4551. A cohort of three 6–8-week-old female C57BL/6 mice purchased from Jackson Laboratory was used in the experiments. The mice were co-housed and equilibrated for a period of 10 days on standard chow (Teklad custom diet, TD.08806) before undergoing a series of three dietary perturbations: high fat (HF), high fat high fiber (HFHF), and low protein (LP) diets, in that order. The HF and HFHF perturbations lasted for 10 days, and the LP perturbation lasted for 9 days, each followed by 2-week periods off perturbations, on standard chow. For the HF perturbation, Teklad custom diet TD.06414 (60% kcal of fat) was used. For the HFHF diet, Teklad custom diet TD.210230 (5% Pectin, 5% Inulin with 61 kcal% of fat) was used. For the LP diet, Teklad custom diet TD.160634 (6 kcal% of protein) was used. Fecal pellets were collected from individual mice on days 10, 18, 35, 43, 57, 65, and 76 in a sterile hood, transferred to 1.5 mL tubes, and placed on dry ice throughout the duration of sampling. Following sample collection, tubes were stored at − 80 °C.

Method details

SAMPL-seq analyses

Fecal pellets were recovered from − 80 °C storage and transferred to 1.5 mL tubes containing methacarn (60% methanol, 30% chloroform, 10% acetic acid). After 24 h of fixation, samples were washed with 70% ethanol and processed following the SAMPL-seq protocol [31]. After fracturing and barcoding, 20–40-micron particles were isolated by size-exclusion filtering for sequencing. For each mouse, two technical replicates of approximately 20,000 particles were used for sequencing. Samples were sequenced from single ends for 152 cycles on an Illumina NextSeq550 using a High-output reagent Kit (Illumina 20,024,907).

Bioinformatics and data pre-processing

Sequence filtering and 16S rDNA amplicon analysis

Raw sequencing reads were first filtered using USEARCH 11.0.667 [80], with a cutoff of less than 1 expected error and a minimum length of 150 bp. Filtered reads were then demultiplexed using ULTRAPLEX [82]. Specifically, ULTRAPLEX was used to identify and extract barcodes from each read, using a custom barcode mapping, and then concatenate extracted barcodes to the fastq header for each read. After barcode extraction, 16S rRNA amplicon primers were removed from reads using SeqKit [83]. The resulting reads correspond to a 69 bp 16S rRNA V4 region. All samples were then pooled together for denoising using UNOISE3 [93], and reads were mapped to zOTUs using VSEARCH [81].

Barcodes were then subjected to error correction using the DNABarcodes package [84] in R using a custom script available on GitHub (https://github.com/gerberlab/MCSPACE). Our barcode set allows for error correction of 1 base error, so barcodes with a Hamming distance larger than 1 were considered uncorrectable and removed. ~ 96% of all barcode sequences were either correct or correctable.

Filtering criteria

SAMPL-seq data were pre-processed before performing inference with MCSPACE to remove potentially artifactual particles and unreliably measured taxa. Three types of filters were applied sequentially:

Removal of particles with large amplification, potentially caused by barcode clashes or other experimental artifacts. A threshold of > 10,000 was used.
Removal of low-abundance particles, which may result from over-amplification, failed amplification, or barcode collisions [30, 31]. To establish filtering thresholds for a minimum number of reads per particle and a minimum OTU abundance across all particles, threshold values were varied and the resulting numbers of particles and OTUs retained were plotted for each threshold (Fig. S1). Final threshold values were then chosen by picking an “elbow” for each plot. A lower read threshold of 250 was chosen, corresponding to the 95th percentile in human data and the 88th percentile in mouse data, whereas the 10,000 read threshold corresponded to the 99.8th percentile in human data and the 99.4th percentile in mouse data.
Removal of low-abundance OTUs, which cannot be reliably measured across samples. OTU relative abundances were computed by aggregating over their abundances in particles remaining after filters 1–2. OTUs occurring above 0.005 relative abundance at any timepoint in human data and OTUs occurring above 0.005 relative abundance in at least two subjects at any timepoint in mouse data were retained.

The first two pre-processing steps reduced the number of particles by 96% in the human time series (from 212,190 to 7055) without perturbation, 74% in the human time series with inulin perturbation (51,866 to 13,250), and 88% in the mouse (489,858 to 56,848) datasets. This substantial reduction was driven by a highly skewed read depth distribution, in which most particles contained very few reads (most contained a single read).

Taxonomic assignments

For the mouse dataset, taxonomy was assigned to OTUs using the RDP classifier [85]. To obtain better taxonomic resolution, a phylogeny-aware taxonomic identification was used for the human dataset, as in Kozlov et al. [94], using the software gappa [86]. Gappa requires phylogenetic information that is not as reliable for the mouse gut microbiome and was therefore not applied to the mouse dataset.

Phylogenetic placement

Phylogenetic placement was used to visualize assemblages on a phylogenetic tree. A reference phylogenetic tree constructed from full-length 16S rRNA sequences was taken from Maringanti et al. [95] (https://github.com/gerberlab/mditre/tree/master/mditre/tutorials/pplacer_files). Placement of OTUs on the tree was then performed by running the software pplacer [87] with default settings.

MCSPACE software

MCSPACE was implemented in Python 3.10 using the PyTorch, Numpy, Scipy, Matplotlib, and Seaborn packages [88–92]. The software is publicly available as an open-source package on GitHub (https://github.com/gerberlab/MCSPACE). The input to MCSPACE consists of three CSV files: (1) a file giving a table of counts for OTUs per barcoded particle, for each sample; (2) a file giving a taxonomic label for each taxon; and (3) a file giving information on which timepoints correspond to perturbations. The software outputs inference results in two files: (a) a PyTorch file that contains the inferred model, and (b) a Python pickle file that contains summary quantities of the model posterior. The software also provides tools to visualize and interpret model results including generating pdf files of graphs for learned spatial assemblages, assemblage proportions, and Bayes factors for significantly perturbed assemblages. See online software documentation (https://github.com/gerberlab/MCSPACE/tree/main/mcspace/docs) and tutorials (https://github.com/gerberlab/MCSPACE/tree/main/mcspace/tutorials) for complete information.

MCSPACE model

MCSPACE uses a sparse Bayesian mixture model to discover, from co-localization count data, latent groups of spatially co-localized microbes (spatial assemblages) as well as changes in assemblage proportions across time and effects due to perturbations. Assume we have $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O$$\end{document}$ OTUs, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S$$\end{document}$ subjects (biological replicates), $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T$$\end{document}$ time points, and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${L}_{s{t}_{i}}$$\end{document}$ particles at time-point $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${t}_{i}$$\end{document}$ in subject $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s$$\end{document}$ . The generative process for the mixture model is then as follows:

For each subject $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s$$\end{document}$ and assemblage k, sample an initial untransformed mixture weight $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${x}_{ks}\left(0\right)\sim \mathrm{Normal}(0,1)$$\end{document}$
For each subject $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s,$$\end{document}$ sample the process variance $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma_{w,s}^{2}=\mathrm{SoftPlus}(\nu_{w,s}^{2})$$\end{document}$ , where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu_{w,s}^{2}\sim \mathrm{Normal}(\mu_{w},\rho_{w}^{2})$$\end{document}$
For each perturbation $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p$$\end{document}$ and assemblage k:

Sample perturbation indicators $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_{kp}\sim \text{Bernoulli }(\pi_{c})$$\end{document}$ for each assemblage $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document}$
Sample perturbation magnitudes $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta_{kp}\sim \mathrm{Normal}(0,\rho_{\delta }^{2})$$\end{document}$ for each assemblage $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document}$

For each subsequent time point $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${t}_{i}$$\end{document}$ , assemblage k, and subject s:

Compute perturbed untransformed weight means $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\eta_{ks}(t_{i})$$\end{document}$ (see below)
Sample untransformed mixture weights $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{ks}(t_{i})\sim \mathrm{Normal}(\eta_{ks}({t}_{i}),\sigma_{w,s}^{2}\Delta {t}_{i})$$\end{document}$

For each assemblage k, sample sparsity indicators $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma_{k}\sim \mathrm{Bernoulli}(\pi_{\gamma})$$\end{document}$ and compute assemblage mixture weights $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta_{ks}({t}_{i})=\frac{\gamma _{k}\mathrm{exp}({x}_{ks}({t}_{i}))}{\sum_{j}\gamma_{j}\mathrm{exp}(x_{js}({t}_{i}))}$$\end{document}$
For each time point $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t$$\end{document}$ , sample contamination mixture weights $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pi_{G,t}=\mathrm{Sigmoid}(\widehat{\pi}_{G,t})$$\end{document}$ , where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{\pi }_{G,t}\sim \mathrm{Normal}(\mu_{\pi},\rho_{\pi }^{2})$$\end{document}$
For each observed particle $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l$$\end{document}$ in subject $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s$$\end{document}$ at time $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t$$\end{document}$ :

Choose an assemblage assignment according to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$z_{ls}(t_{i})\sim \mathrm{Categorical}({\boldsymbol{\beta}}_{{\boldsymbol{s}}}(t_{i}))$$\end{document}$
For each read $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j$$\end{document}$ , in particle $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l$$\end{document}$ :
i.Choose whether the read arises from the assemblage or the contamination cluster according to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_{lj}\sim \mathrm{Bernoulli}(\pi_{G,t})$$\end{document}$
ii.Sample read $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${y}_{loj} \; | \; {a}_{lj} \sim \left\{\begin{array}{c} \mathrm{Categorical}(\theta_{z_{lo}}) \; \mathrm{if} \; a_{lj}=0\\ \mathrm{Categorical}(G_{to}) \; \mathrm{if} \; a_{lj}=1\end{array}\right.$$\end{document}$

Perturbed untransformed assemblage weights $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\eta_{k,s}({t}_{i})$$\end{document}$ are computed as follows:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned}\eta_{k,s}(t_{i})=x_{k,s}(t_{i-1})+c_{kp}\delta_{kp}, \quad \text{if perturbation}\, p \ \mathrm{turns}\ \mathbf{on}\ \mathrm{at\ time}\ t_{i}\\\eta_{k,s}(t_{i})={x}_{k,s}(t_{i-1})-c_{kp}\delta_{kp}, \quad \text{if perturbation}\, p \ \mathrm{turns}\ \mathbf{off}\ \mathrm{at\ time}\ t_{i}\end{aligned}$$\end{document}

The terms $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_{kp}\delta_{kp}$$\end{document}$ correspond to perturbation effects. In the absence of an experimental perturbation, evolution of untransformed mixture weights is thus modelled as occurring via temporal drift.

The assemblage parameters $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta_{ko}$$\end{document}$ , which correspond to the probability of generating reads from OTU o in assemblage k, are modeled non-stochastically and are optimized during the inference procedure. The corresponding parameters from the contamination clusters $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G_{to}$$\end{document}$ are computed from data as $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${G}_{to}=\frac{{\sum }_{s}{\sum }_{l}{r}_{lsot}}{{\sum}_{s}{\sum}_{l}{\sum}_{j}{r}_{lsjt}}$$\end{document}$ , where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${r}_{lsot}$$\end{document}$ corresponds to the read count for OTU $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$o$$\end{document}$ in particle $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l$$\end{document}$ in subject $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s$$\end{document}$ at time $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t$$\end{document}$ , modeling the assumption that contamination arises from unencapsulated DNA that occurs with frequencies proportional to the total DNA in the sample for each OTU.

Hyperparameter settings

Model hyperparameters were chosen to ensure either diffuse priors (large variances, modeling limited/no prior information) or model parsimony (to encourage sparse results), depending on the role of the parameter. These settings are designed to be robust so that, in general use cases, no manual tuning is required by the user. A table of each hyperparameter, its definition, and setting is given Table 1.

Table 1. Hyperparameters, default settings, and rationale for settingsSymbolDefinitionSettingRationale $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K$$\end{document}$ Maximum possible number of assemblages100Chosen to be larger than the expected value. Assemblage indicators select for a subset of the maximum possible. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mu}_{w}$$\end{document}$ Process variance prior location0.01Prior expectation of 1% process variance. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\rho}_{w}^{2}$$\end{document}$ Process variance prior variance10Set to yield a diffuse prior. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\pi}_{c}$$\end{document}$ Perturbation indicator prior probability0.5/KPrior expectation of no perturbation effect on any assemblage, to encourage model parsimony. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mu }_{\delta }$$\end{document}$ Perturbation magnitude prior location0Reflects equal plausibility of positive or negative effects. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\rho }_{\delta }^{2}$$\end{document}$ Perturbation magnitude prior variance100Set to yield a diffuse prior. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mu }_{\pi }$$\end{document}$ Contamination cluster weight prior locationLogit(0.05)Prior expectation of 5% contamination; weakly informative. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\rho }_{\pi }^{2}$$\end{document}$ Contamination cluster weight prior variance10Set to yield a diffuse prior. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\pi }_{\gamma }$$\end{document}$ Assemblage indicator prior probability $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\left(\frac{0.5}{K}\right)}^{\xi }$$\end{document}$ Prior expectation of less than one assemblage being present, to encourage model parsimony. The exponent $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi$$\end{document}$ is used to scale the prior strength to the amount of observed data. We set $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi =0.005\times \#\mathrm{reads}$$\end{document}$ , where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\#\mathrm{reads}={\sum }_{lsot}{r}_{lsot}$$\end{document}$ , is the total number of reads in the dataset, corresponding to a prior strength of 0.5% of the dataset size.

Hyperparameter sensitivity

Hyperparameters controlling the presence of perturbation effects ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\pi }_{c})$$\end{document}$ and assemblage sparsity ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\pi }_{\gamma }$$\end{document}$ ) directly impact model complexity and were set to promote parsimony, as described above. Settings for variances on process variance, perturbation magnitude, and contamination weights were chosen to model diffuse priors. To evaluate the sensitivity of results for these hyperparameter settings for diffuse priors, we used semi-synthetic time-series data (see below for details on the data generation procedure). These data were generated by simulating from MCSPACE-inferred results on the mouse dataset, chosen because it contained multiple perturbations, which allowed for calibration and testing of the perturbation hyperparameters. We generated 10 semi-synthetic datasets and evaluated AUC for pairwise association detection, assemblage recovery error, and assemblage number deviation. Results varying the hyperparameters across two orders of magnitude (Fig. S6) showed consistency of benchmarking metrics, demonstrating that the model is not sensitive to the hyperparameter settings used.

Inference

Inference was performed using a variational autoencoder-based method [36, 37], in which parameters of the approximating distributions are estimated as functions of the data learned by a deep neural network. Specifically, inference networks were constructed that take a normalized representation $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\widehat{r}}_{lsot}$$\end{document}$ of the data as input, and output parameters for Gaussian approximating distributions, i.e., $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${x}_{kst}\sim \mathrm{Normal}\left({\mathrm{MLP}}_{\mu}({\widehat{r}}_{lsot}), {\mathrm{MLP}}_{\sigma^{2}}(\widehat{r}_{lsot})\right)$$\end{document}$ . Normalized reads for each particle $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l$$\end{document}$ were first passed into a fully connected two-layer multilayer perceptron (MLP) encoder network with $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O$$\end{document}$ inputs and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H=50$$\end{document}$ outputs, with SoftPlus activations. The outputs of the encoder were then averaged over all particles and then passed through final linear layers that output parameters for the mean, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu$$\end{document}$ , and log standard deviation, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm{log}(\sigma)$$\end{document}$ , of the approximating distribution. Kullback–Leibler (KL) terms were computed analytically for all variables except for $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${x}_{kst}$$\end{document}$ , which we approximated using a Stochastic Gradient Variational Bayes estimator. Variables selecting the assemblage type, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${z}_{lst}$$\end{document}$ were marginalized out for inference. A Gumbel-Softmax approach for the Bernoulli-distributed variables $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\boldsymbol{\gamma}}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\boldsymbol{c}}$$\end{document}$ was used, with global parameters for each of their variational approximations [96].

The model was implemented using Pytorch [88] and trained end-to-end via gradient descent employing the ADAM optimizer with default parameters and an initial learning rate set to 0.005. Model parameters other than $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\theta }_{ko}$$\end{document}$ were randomly initialized from standard normal distributions. For initializing the assemblage parameters $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\theta }_{ko}$$\end{document}$ , an initial fit with the K-means clustering algorithm was performed using the function K-means from the sklearn.cluster package. Data input to K-means was transformed with an isometric log-ratio (ILR) transformation. For K-means fitting, the n_clusters parameter was set equal to the MCSPACE model capacity; default values were used for all other parameters. An inverse ILR transformation was then used to map K-means inferred cluster centers back onto the simplex, which were then used to initialize assemblage parameters $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\theta }_{ko}$$\end{document}$ . During inference, the sparsity prior’s exponent parameter, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi$$\end{document}$ , was gradually increased, with $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi$$\end{document}$ initially set to 1 for the first 10% of training epochs, then linearly increased, reaching its final value at 90% of epochs, then held constant for the last 10% of epochs. MCSPACE was run with 10 resets with different initial seeds, and the model with the lowest average loss was selected.

Model training and inference were performed on a workstation with an AMD Ryzen 7 7745 HX CPU, 16 GB RAM, and an NVIDIA RTX 4070 GPU (8 GB VRAM). Runtimes for each dataset are given in Table 2. Table 2. Run times for MCSPACE on the datasets analyzedDatasetNumber particlesNumber OTUsRuntime (GPU)Runtime (CPU)Human time series without perturbation7055581 h 27 m2 h 5 mHuman time series with inulin13,250502 h 20 m3 h 22 mMouse time series56,848744 h 30 m20 h 38 m

Bayes factors

Bayes factors were computed to assess the evidence of alternative models indicating the presence or absence of a perturbation effect given the data. For a given perturbation effect from perturbation $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p$$\end{document}$ on assemblage $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document}$ , Bayes factors $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{BF}}_{kp}$$\end{document}$ were calculated from the estimated posterior probability of the perturbation being present $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${q}_{{c}_{kp}}$$\end{document}$ and the prior probability $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\pi }_{c}$$\end{document}$ . That is, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{BF}}_{kp}=$$\end{document}$ $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{{q}_{{c}_{kp}}}{1-{q}_{{c}_{kp}}}\cdot \frac{1-{\pi }_{c}}{{\pi }_{c}}$$\end{document}$ .

Posterior summary of assemblage proportions

Posterior assemblage proportions $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\boldsymbol{\beta}}$$\end{document}$ were summarized as follows. First, assemblages with posterior probability $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\gamma }_{k}<0.95$$\end{document}$ were removed. Latent mixture weights $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\boldsymbol{x}}$$\end{document}$ were then sampled 1000 times from the estimated posterior distribution and passed through a softmax function with discarded assemblages masked out, to get posterior samples of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\boldsymbol{\beta}}}^{{\boldsymbol{s}}}$$\end{document}$ for assemblages to be used in downstream analyses. The arithmetic mean over samples $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\boldsymbol{\beta}}}^{{\boldsymbol{s}}}$$\end{document}$ was then computed to get the posterior mean of assemblage proportions.

Comparator generative models

The scikit-learn [97] package was used for the basic Gaussian mixture model. Specifically, the function GaussianMixture from the sklearn.mixture package was used, with default parameters. The model was trained on the relevant input data, varying the number of components with 10 runs each time with different initial seeds, and the best model was selected based on the Akaike information criterion (AIC) [98].

For directional Gaussian mixture models, implementations provided in https://github.com/amepas/Spatial_Mbiome were used. Following Pasarkar et al. [35], each model was trained with increasing numbers of assemblages, each with 10 different initial seeds, and the best-fit model was selected based on the AIC. Notably, the directional GMMs failed to converge when the number of assemblages exceeded 10.

Read counts were first converted to relative abundances, and then, an isometric log-ratio (ILR) transformation was applied as in Pasarkar et al. [35] Zeros were handled using multiplicative replacement [99] using $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta =1/{O}^{2}$$\end{document}$ for $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O$$\end{document}$ taxa as in Pasarkar et al. [35]

Benchmarking with simulated data

Semi-synthetic data generation

To benchmark MCSPACE against comparator methods and to evaluate model sensitivity to hyperparameter settings, we generated semi-synthetic data based on assemblages inferred on real data. This approach allowed us to test model performance under controlled, yet realistic, conditions.

For evaluating against comparator methods, semi-synthetic data were generated from the human dataset. MCSPACE was run on the initial time point of the human dataset, resulting in $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=34$$\end{document}$ baseline assemblages. Semi-synthetic data were then generated from the baseline assemblages using the following bootstrapping-style approach:

Sample $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K$$\end{document}$ assemblage frequencies $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\boldsymbol{\theta}}$$\end{document}$ and their proportions $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\boldsymbol{\beta}}$$\end{document}$ , with replacement from the baseline assemblages. Re-normalize assemblage proportions $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\boldsymbol{\beta}}$$\end{document}$ to sum to unity, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\boldsymbol{\beta}}\leftarrow{\boldsymbol{\beta}}/{\sum }_{k}{\beta }_{k}$$\end{document}$ .
For each vector of assemblage frequencies $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\theta }_{ko}$$\end{document}$ , randomly permute OTU labels $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$o$$\end{document}$ .

aSample perturbation indicators $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${c}_{kp}\sim \mathrm{Bernoulli}({\widehat{\pi }}_{\mathrm{c}})$$\end{document}$ with $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\widehat{\pi }}_{c}$$\end{document}$ estimated from the posterior.
bSample perturbation magnitudes $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\delta }_{kp}\sim \mathrm{Normal}({\widehat{\mu }}_{\delta },{\widehat{\rho }}_{\delta }^{2})$$\end{document}$ with perturbation magnitude location $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\widehat{\mu }}_{\delta }$$\end{document}$ and scale $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\widehat{\rho }}_{\delta }^{2}$$\end{document}$ estimated from the posterior.

Generate particle data:

For each particle $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l$$\end{document}$ :
i.Sample the number of reads in particle $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${R}_{l}\sim \mathrm{NegativeBinomial}(n,p)$$\end{document}$
ii.Sample an assemblage $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${z}_{l}\sim \mathrm{Categorical}({\boldsymbol{\beta}})$$\end{document}$
iii.For each read $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j$$\end{document}$ , in particle $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l$$\end{document}$ :

Choose whether the read arises from the assemblage or the contamination cluster according to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${a}_{lj}\sim \mathrm{Bernoulli}\left({\pi }_{G}\right)$$\end{document}$
Sample read $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${y}_{loj} \; | \; {a}_{ij}\sim \left\{\begin{array}{c}\mathrm{Categorical}\left({\theta }_{{z}_{l}o}\right) \; \mathrm{if} \; {a}_{lj}=0\\ \mathrm{Categorical}\left({G}_{to}\right) \; \textrm{if} \; {a}_{lj}=1\end{array}\right.$$\end{document}$

For varying the number of particles or reads per particle, each feature was varied individually, whereas the other was kept constant using default values based on what was inferred on real data. The default values used were 1419 particles, negative binomial parameters p = 0.000792, n = 2.582442, and a contamination rate of 0.004.

MCSPACE and the basic GMM models were set to learn a maximum of 100 assemblages; as noted above, the directional GMM model failed to converge with more than 10 assemblages, so this was the maximum it was allowed to learn.

To assess MCSPACE’s sensitivity to hyperparameter settings, including those influencing perturbation parameters, we generated semi-synthetic time-series data based on results from our mouse dataset, which included multiple perturbations over time. This data generation process extended the same bootstrapping procedure described above to multiple time points by evolving assemblage proportion $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\boldsymbol\beta}_t$$\end{document}$ over time using a Gaussian random walk. Specifically,

Sample $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K$$\end{document}$ latent assemblage proportions, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{ks}\left(0\right)$$\end{document}$ , from baseline inferred values
For each assemblage $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document}$ and perturbation $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p$$\end{document}$ :
For each subject $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s$$\end{document}$ and each subsequent timepoint $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t_i$$\end{document}$ :

aCompute perturbed untransformed weight means $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\eta_{ks}(t_i)$$\end{document}$
i. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\eta_{k,s}({t}_{i})= x_{k,s}({t}_{i-1})+{c}_{kp}{\delta }_{kp},\text{ if perturbation }p\text{ turns }\mathbf{on}\text{ at time }{t}_{i}$$\end{document}$
ii. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\eta_{k,s}\left(t_{i}\right)= x_{k,s}({t}_{i-1})-{c}_{kp}{\delta }_{kp},\text{ if perturbation }p\text{ turns }\mathbf{o}\mathbf{f}\mathbf{f}\text{ at time }{t}_{i}$$\end{document}$
bSample untransformed mixture weights $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{ks}(t_i)\sim\mathrm{Normal}\left(\eta_{ks}(t_{i}),\sigma_{w,s}^2\Delta{t}_{i}\right)$$\end{document}$ , with process variance $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma_{w,s}^2$$\end{document}$ estimated from the posterior
cCompute assemblage mixture weights $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta_{ks}\left(t_i\right)=\frac{\exp\left(x_{ks}\left(t_i\right)\right)}{\sum_j\exp\left(x_{js}\left(t_i\right)\right)}$$\end{document}$

Simulation settings for the number of particles, reads per particle, and contamination rate were set to match empirical data. Specifically, 2829 particles per sample and negative binomial parameters p = 0.000726, n = 1.107895, and a contamination rate of 0.018.

Comparison to pairwise analysis methods

To compare MCSPACE with methods that detect pairwise associations, a relative abundance threshold of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau =0.005$$\end{document}$ was used to determine presence/absence of an OTU in each particle, consistent with what was used in Urtecho et al. [100] Ground-truth associations between pairs of OTUs $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j$$\end{document}$ in assemblage k were calculated from simulation runs as $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${p}_{ij}=I\left({\theta }_{ki}>\tau \right)I\left({\theta }_{kj}>\tau \right)$$\end{document}$ , i.e., the association exists if and only if the abundance of both OTUs is greater than the threshold in an assemblage.

For the Fisher’s exact test and SIM9 algorithms, particle reads were first binarized. The probability of co-association with Fisher’s exact test was computed as in Sheth et al. [30] Briefly, 2 by 2 contingency tables of appearance were calculated for all pairs of OTUs and Fisher’s exact test was then used to calculate the probability of pairs occurring together more frequently than expected, assuming equiprobable occupancy at all sites. Resulting p values were then adjusted via the Benjamini–Hochberg procedure [101].

For the SIM9 approach [40], co-association was quantified using the “sim9_single” function in the EcoSimR package [102]. On each set of particles, a binarized OTU table was subjected to a random swap, preserving OTU prevalence and particle diversities. This step was performed 25,000 times to generate a randomized community based on the original diversity of the dataset. Fifty such randomized communities were generated to produce a null distribution of OTU co-localization. The location of the observed co-occurrence frequency was then determined in this null distribution, and a one-tailed p value was computed using a z-test. Resulting p values were then adjusted via the Benjamini–Hochberg procedure [101].

Co-association probabilities were computed for MCSPACE as follows: For a given posterior sample $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s$$\end{document}$ , the probability that OTUs $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j$$\end{document}$ co-occurred together in some particle was computed as $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${p}_{ij}^{s}={\sum }_{k}{\beta }_{k}^{s}I({\theta }_{ki}>\tau )I({\theta }_{kj}>\tau ).$$\end{document}$ The mean over $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S=1000$$\end{document}$ posterior samples was then computed to obtain the final probability of co-association, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${p}_{ij}=\frac{1}{S}{\sum }_{s}{p}_{ij}^{s}$$\end{document}$ .

The area under the receiver operator curve (AUC ROC) was then computed for all methods, using adjusted p values for Fisher’s exact test and SIM9, and using posterior co-occurrence probabilities for MCSPACE.

Assemblage recovery error

The number and/or ordering of assemblages inferred by models will not in general match that of the ground truth assemblages. To account for this and provide a means to objectively compare inferred assemblages against the ground truth, we developed an algorithm that computes a metric that we term the assemblage recovery error (Fig. S5). Specifically, we computed this metric as follows:

Find all possible distances between inferred and true assemblages. Specifically, we compute a matrix $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${D}_{ij}$$\end{document}$ of all pairwise distances between inferred and ground truth assemblages, using the Hellinger distance metric. The Hellinger distance is often used to quantify the distance between two probability distributions over discrete variables and is given as $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H\left({\boldsymbol{a}},{\boldsymbol{b}}\right)=\frac{1}{\sqrt{2}}\sqrt{{\sum }_{i}{\left(\sqrt{{a}_{i}}-\sqrt{{b}_{i}}\right)}^{2}}$$\end{document}$ , for distributions $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\boldsymbol{a}}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\boldsymbol{b}}$$\end{document}$ .
Initialize an accumulating total error. Set the total error as $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d=0$$\end{document}$ .
Match inferred and true assemblages. While the number of rows and columns of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${D}_{ij}$$\end{document}$ is both greater than 1:

Compute the minimum element of the distance matrix $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m =\underset{\mathrm{i,j}}{\min}{D}_{ij}$$\end{document}$ and add to the accumulated error $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d=d+m$$\end{document}$ .
Remove the row and column of the distance matrix $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D$$\end{document}$ corresponding to the minimum element $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m$$\end{document}$ .

Add penalties for unmatched assemblages. For all remaining unmatched assemblages, add a penalty value of 1.0 to the total $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d$$\end{document}$ .
Normalize the final error. Normalize the error as $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d=\frac{d}{{N}_{\max}},$$\end{document}$ where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${N}_{\max}= {\max}({N}_{l}, {N}_{g})$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${N}_{l}$$\end{document}$ are the number of learned assemblages and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${N}_{g}$$\end{document}$ is the number of ground truth assemblages.

We compute this metric over $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S=100$$\end{document}$ posterior samples from the MCSPACE ground-truth model. For each posterior sample $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s$$\end{document}$ , we only keep assemblages with $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\gamma }_{k}^{s}=1$$\end{document}$ , that is, those that are selected to be present in that posterior sample. The final error is summarized as the median over assemblage errors computed on each posterior sample.

For the Gaussian mixture models, inferred assemblage distributions were obtained by applying an inverse-ILR transformation on inferred cluster means to convert them back to relative abundances.

Cross-validation analysis with real data

For each data sample, particle data were split randomly into five test/train datasets, holding out 20% of particles in each fold. For each held-out particle, reads were downsampled by sampling 50% of the total reads from the particle. Each model was then tasked with predicting the held-out reads for each downsampled particle.

For each cross-validated fold, MCSPACE was run with a capacity of K = 100 assemblages. The GMM was run for a range of 2 to 100 assemblages, and the best model was selected using the AIC, as described above. As noted above, the directional GMMs failed to converge above 10 assemblages and were therefore only run up to 10 assemblages.

For GMMs, an assemblage assignment was obtained for each downsampled particle using maximum likelihood. Predicted reads $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{{\boldsymbol{r}}}$$\end{document}$ were then obtained from the assigned assemblage by first applying an inverse-ILR transformation on the inferred cluster to get relative abundances and then multiplying the relative abundances by the number of reads of the original particle.

Each held-out particle $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\boldsymbol{r}}$$\end{document}$ was then compared to the predicted reads $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{{\boldsymbol{r}}}$$\end{document}$ for that particle using the cosine distance: $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d=1-\frac{{\boldsymbol{r}}\cdot \widehat{{\boldsymbol{r}}}}{\left|{\boldsymbol{r}}\right||\widehat{{\boldsymbol{r}}}|}$$\end{document}$ .

For MCSPACE, for each held out particle $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\boldsymbol{r}}$$\end{document}$ , 1000 posterior samples $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s$$\end{document}$ were obtained for predicted reads $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\widehat{{\boldsymbol{r}}}}^{s}$$\end{document}$ , and the cosine distance was taken for each posterior sample, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${d}^{s}=1-\frac{{\boldsymbol{r}}\cdot \widehat{{{\boldsymbol{r}}}^{{\boldsymbol{s}}}}}{\left|{\boldsymbol{r}}\right||{\widehat{{\boldsymbol{r}}}}^{s}|}$$\end{document}$ . The median was then taken over posterior samples of the cosine distance $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${d}^{s}$$\end{document}$ , for each particle.

Analysis of human and mouse datasets

Association analysis

To visualize changes between taxa associations over time and across conditions, we computed pair-wise association scores. We define the association score as the probability of generating a pair of reads of both OTUs i and j from the same assemblage, normalized by the marginal probabilities of generating two reads containing the OTUs: $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\alpha }_{ij}\left(t\right)=\frac{{\sum }_{k}{\beta }_{kt}{\theta }_{ki}{\theta }_{kj}}{0.5({\sum }_{k}{\beta }_{kt}{\theta }_{ki}+{\sum }_{k}{\beta }_{kt}{\theta }_{kj})}$$\end{document}$ .

Quantification and statistical analysis

Model comparisons

For comparing model performance, statistical significance was assessed with the Wilcoxon rank sum test followed by Benjamini–Hochberg correction for multiple hypothesis testing. On semi-synthetic data, for each condition, each model was run on (n = 10) simulated data replicates. For cross-validated performance on real data, distributions of cosine distance were compared for all particles remaining after quality filtering (n = 7055 in human data and n = 56,848 in mouse data).

Supplementary Information

Additional file 1: Table S1. Composition of diets used in mouse study, in g/Kg. Fig. S1: SAMPL-seq data pre-processing filtering curves. Thresholds for minimum reads per particle and OTU abundance across particles were varied, with the number of retained particles and OTUs plotted against each threshold. (A-C) Number of OTUs remaining versus minimum relative abundance thresholds for human (A), human with inulin perturbation (B), and mouse (C) datasets. (D-F) Number of particles remaining versus threshold for minimum number of reads per particle for human (D), human with inulin perturbation (E), and mouse (F) datasets. Red circles and corresponding red value on x-axis correspond to estimates of “elbows” (0.005 minimum abundance and 250 minimum reads per particle), which were then used for final data filtering. Fig. S2. Summary of longitudinal human SAMPL-seq spatial co-localization dataset. Visualization of filtered data from a longitudinal study of gut microbiome spatial co-localization in a healthy human participant (collected daily for five days, n = 5 fecal samples total). (A) Phylogenetic tree of Operational Taxonomic Units (OTUs) present in particles. (B) Clustered heatmap of particles, showing the relative abundance of taxa in particles over each of the five consecutive days. (C) Density plots of OTUs per particle. Fig. S3. Summary of longitudinal human SAMPL-seq spatial co-localization dataset with inulin supplementation. Visualization of filtered data from a longitudinal study of gut microbiome spatial co-localization in a healthy human participant with inulin supplementation perturbation (collected over 12 days, n = 7 fecal samples total). (A) Phylogenetic tree of Operational Taxonomic Units (OTUs) present in particles. (B) Clustered heatmap of particles, showing the relative abundance of taxa in particles in each sample. (C) Density plots of OTUs per particle. Fig. S4. Summary of new murine longitudinal SAMPL-seq spatial co-localization dataset with multiple dietary perturbations. Visualization of filtered data from a longitudinal study of gut microbiome spatial co-localization in 3 mice subjected to defined dietary perturbations (n = 21 total fecal samples). HF = high fat; HFHF = high fat, high fiber; LP = low protein; S1-4 = standard diet 1–4. (A) Phylogenetic tree of OTUs present in particles. (B) Hierarchically clustered heatmap of particles, pooled over the three mice, showing the relative abundance of taxa in particles in each of the diets. (C) Density plots of OTUs per particle distributions, pooled over the three mice, for each diet*.* Fig. S5. Schematic of assemblage recovery error metric calculation. Our metric accounts for differences in both the number and order of recovered assemblages, which may differ from that of ground truth. To compute the metric, a greedy algorithm matches inferred and true assemblages by iteratively pairing the closest matches based on pairwise distances, then accumulating errors for each pair. Unmatched assemblages incur a penalty, and the total error is normalized by the larger of the inferred or true assemblage count. Full algorithmic details are provided in the Methods section. Fig. S6. MCSPACE performance is robust to hyperparameter settings. MCSPACE was evaluated using default settings and with hyperparameters for the variance on the prior for process variance, perturbation magnitude, and contamination weight increased by 10X and 100X. Benchmarking metrics were consistent across all settings, indicating that model performance is robust to the chosen hyperparameters. Semi-synthetic data was simulated from the MCSPACE model inferred on real data. Methods were assessed for their ability to recover ground-truth information in three tasks: (Top row) Detecting co-associated pairs of microbes, assessed with area under the receiver operator curve (AUC) (higher values indicate superior performance), (Middle row) Recovering the correct frequencies of OTUs in assemblages, assessed using assemblage recovery error (lower values indicate superior performance), and (Bottom row) Inferring the correct number of assemblages, assessed by subtracting the inferred value from the correct value (values nearer to zero indicate superior performance). Boxplots show results from 10 simulated replicates. Central lines indicate medians with boxes representing interquartile ranges (IQRs). Whiskers extend to data points within 1.5 × IQR. Statistical significance was assessed with a Wilcoxon rank sum test followed by Benjamini–Hochberg correction for multiple hypothesis testing*.* Fig. S7. MCSPACE identified spatial assemblages among taxa and temporal changes in assemblage proportions in the human gut microbiome from a longitudinal SAMPL-seq dataset. MCSPACE identified 58 OTUs assorting into 40 spatial assemblages, with assemblage abundances tracked over time. A phylogenetic tree of OTUs present in the dataset is displayed on the left. Heatmaps show assemblage proportions over the five consecutive days in the study (above), and OTU frequencies in inferred spatial assemblages (below) with OTU assemblage frequencies ≥ 0.05 shown. Fig*.* S8. MCSPACE identified spatial assemblages and their temporal changes in SAMPL-seq human gut microbiome dataset with inulin perturbation. MCSPACE identified 50 OTUs assorting into 42 spatial assemblages, with assemblage abundances tracked over time. A phylogenetic tree of OTUs present in the dataset is displayed on the left. Heatmaps show assemblage proportions over each day in the study (above), and OTU frequencies in inferred spatial assemblages (below) with OTU assemblage frequencies $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge$$\end{document}$ 0.05 shown. Fig. S9. MCSPACE analysis reveals inulin supplement-induced changes in spatial associations of key taxa in human study. Spatial association mapping of Bifidobacterium OTU6 and Agathobacter rectalis OTU1. Maps depict associations on pre-inulin, inulin, and post-inulin supplementation dietary intervals. Node sizes indicate relative abundances of taxa and edge widths indicate the strengths of spatial associations (< 0.01 not shown). Relative abundances and spatial association scores were averaged over days in each dietary period (days 1,2,4 for pre-inulin; 7,8,10 for inulin; 11 for post-inulin). During inulin supplementation, Bifidobacterium OTU6 gained or strengthened associations with many taxa (e.g. A. rectalis OTU1, R. faecis/torques OTU17, Blautia OTU8, Lachnospiraceae OTU82, and F. saccharivorans OTU9), many of which also increased in abundance during inulin. Several associations persisted post-inulin. A. rectalis OTU1 also gained associations during and after inulin supplementation, including a transient increase in its association with Bifidobacterium OTU6. Fig. S10. MCSPACE identified spatial assemblages among taxa and changes in assemblage proportions in the murine gut microbiome from a SAMPL-seq dataset investigating multiple dietary perturbations. MCSPACE identified 74 OTUs assorting into 63 spatial assemblages, with assemblage abundances tracked over time for three biological replicates. A phylogenetic tree of OTUs present in the dataset is shown on the left. Heatmaps show assemblage proportions (above) inferred for each biological replicate over the seven dietary intervals (HF = high fat; HFHF = high fat, high fiber; LP = low protein; S1-4 = standard diet 1–4), and OTU frequencies in inferred spatial assemblages (below). OTUs with assemblage frequencies ≥ 0.05 are shown. Fig. S11*.* MCSPACE analysis of the murine data revealed diet-induced spatial shifts in the gut microbiome. Assemblages with strong evidence of an effect from one or more perturbations (Bayes Factors > 10) are shown for each dietary perturbation. HF = high fat; HFHF = high fat, high fiber; LP = low protein; S1-3 = standard diet 1–3. Phylogenetic tree of OTUs present with frequency > 5% in any of the perturbed assemblages is shown on the left. Heatmaps show assemblage proportions before and after perturbation for each dietary perturbation (above) and OTU frequencies in perturbed spatial assemblages (below).

Bibliography5

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Pédron T, Mulet C, Dauga C, Frangeul L, Chervaux C, Grompone G, Sansonetti PJ. A crypt-specific core microbiota resides in the mouse colon. M Bio 2012;3. 10.1128/mbio.00116-00112.10.1128/m Bio.00116-12PMC 337296522617141 · doi ↗ · pubmed ↗
2Richardson M, Zhao S, Sheth RU, Lin L, Qu Y, Lee J, Moody T, Ricaurte D, Huang Y, Velez-Cortes F. SAMPL-seq reveals micron-scale spatial hubs in the human gut microbiome. Nat Microbiol. 2024. 10.1038/s 41564-024-01914-4.PMC 1212079939901058 · doi ↗ · pubmed ↗
3Ari Sarfatis et al. , Highly multiplexed spatial transcriptomics in bacteria. Science 387, eadr 0932(2025). 10.1126/science.adr 0932.10.1126/science.adr 0932 PMC 1227806739847624 · doi ↗ · pubmed ↗
4Urtecho G, Moody T, Huang Y, Sheth RU, Richardson M, Descamps HC, Kaufman A, Lekan O, Zhang Z, Velez-Cortes F, Qu Y. Spatiotemporal dynamics during niche remodeling by super-colonizing microbiota in the mammalian gut, Cell Systems. 2024:15(11);1002-17.e 4 ISSN 2405-4712. 10.1016/j.cels.2024.10.007.10.1016/j.cels.2024.10.007PMC 1206617339541983 · doi ↗ · pubmed ↗
5Gotelli N, Hart E, Ellison A, Hart ME. Package ‘Eco Sim R’. R Package. 2015. https://github.com/Gotelli Lab/Eco Sim R.