BrainConnect: processing brain connectivity and spatial transcriptomics data for integrative analysis

Chenglong Sang; Cheng Peng

PMC · DOI:10.1093/bioinformatics/btag120·March 10, 2026

BrainConnect: processing brain connectivity and spatial transcriptomics data for integrative analysis

Chenglong Sang, Cheng Peng

PDF

Open Access

TL;DR

This paper introduces BrainConnect, a software tool that integrates brain connectivity and spatial transcriptomics data to predict and understand neural connections and their molecular basis.

Contribution

The novel contribution is a software framework that processes brain connectivity and spatial transcriptomics data together to predict connectivity strengths using machine learning.

Findings

01

The model accurately predicted connectivity strengths based on spatial transcriptomics data.

02

The software helps identify important genes potentially involved in regulating brain connectivity.

03

BrainConnect provides a consistent data format for integrative analysis of brain datasets.

Abstract

Characterizing the neuronal connectomes provides route to understand the basis of neural circuit in brains, one of the central missions in neuroscience, but the mapped connectivity is absent of molecular information, obscuring the understanding on the important genes underlying the connectomes. The whole-brain spatial transcriptomics data provide the opportunity to predict and understand the brain connectivity. However, there is no method to process these datasets in consistent data format for integrative analysis. In this work, we developed a software to process different kinds of mouse brain connectivity data together with spatial transcriptomics in consistent brain regions to define the connectivity path and strength and then used the long short-term memory network to predict connectivity strengths from the spatial transcriptomics by using our data framework. We evaluated the model…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Mus musculus

Figures4

Click any figure to enlarge with its caption.

Software modules. (A) Single neuron data processing. The software reconstructs and transforms single-neuron morphological data into path-based representations which are subsequently annotated using the Allen brain CCF to generate connectivity paths. (B) Connectivity path construction. By analyzing the path with same start and end brain regions, the software sorts the averaged major paths that capture the common features of single-neuron path population. (C) Connectivity strength integration. The software processes rAAV/AAV fluorescence from the Allen Brain Atlas to generate connectivity strength and then combines strength data with averaged major path derived from single neurons. (D) Connectivity strength prediction. The spatial transcriptomics data are processed to the regional gene expression matrix under the Allen brain CCF, and LSTM is used to predict connectivity strength between start and end brain regions by using spatial transcriptomics.

Connectivity path and strength integration. (A) Illustration of major connectivity path generation. The top subpanel presents single-neuron imaging data and Allen mouse CCF coordinates with 25-micron resolution. The bottom subpanel depicts the averaged major path from CA1 to AON sorted from single neuron paths. (B) Integration of connectivity path and strength. The projection strength derived from rAAV/AAV fluorescence were integrated with the depicted path under CCF coordinate. (C) Statistical analysis of averaged major paths from start to end brain regions derived from all single-neuron paths.

The architecture of connectivity strength prediction. The spatial transcriptomics are processed under CCF coordinate, and the dimension-reduced gene expressions are input to a fully connected network. The initial gene expression embedding and connectivity strength were used to initialize the cell state and hidden state. Then these states are sequentially updated based on the gene expression embeddings by using LSTM.

Model evaluation and comparison. (A) Model validation by using independent single-neuron imaging dataset. (B) Model comparison under five-fold cross-validation. The LSTM, GRU, and RNN models incorporated sequential path information and brain region features step by step, whereas the XGBoost, Random Forest and Neural Network models used concatenated gene expression embeddings of the sequential brain regions as the input. The predictions utilize the path lengths ranging from 2 to 12 in A and B. The paired-t test was used in the two-sample comparisons. *P < 0.05, P < 0.01. (C) Important gene selection from LSTM prediction model.

Funding1

—National Natural Science Foundation of China10.13039/501100001809

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSingle-cell and spatial transcriptomics · Functional Brain Connectivity Studies · Bioinformatics and Genomic Networks

Full text

1 Introduction

Mammalian brain is composed of a large number of interconnected neurons that form complex connectivity, which plays essential roles in regulating the transmission of information between neurons, with functions involving learning, cognition, decision, etc. (Luo 2021). Many neurological disorders, such as autism and schizophrenia, are often associated with abnormal brain connectivity (Blaylock and Faria 2021, Du et al. 2021, Jutla et al. 2022). Thus, it is one of the central missions in neuroscience to decipher the brain connectivity and corresponding functions encoded in the brain (French and Pavlidis 2011, Ji et al. 2014, Sun et al. 2023). The connectivity mapping technologies, including high-resolution imaging and viral tracing methods, provide efficient tools to map the brain connectivity at different scales, ranging from synaptic-level connectivity to long-range projectomes. The imaging technology, such as fluorescence Micro-Optical Sectioning Tomography (fMOST) (Li et al. 2010, Gao et al. 2022, Liu et al. 2025), provides single-neuron connectomes, while the trans-synaptic viral tracing method provides the whole-brain projectomes on how individual neurons extend the axons and transmit information across distant brain regions (Chamberlin et al. 1998, Lo and Anderson 2011, Rivera et al. 2025).

Though the imaging connectomes and viral-tracing projectomes have revealed principles of brain connectivity, these two kinds of approaches mainly worked separately at different levels without explicit integrations. In addition, both the connectomes and projectomes are absent of molecular information on neurons, hindering the exploration of molecular mechanisms underlying the brain connectivity. Importantly, these experimental reconstructions are costive (Kebschull et al. 2016, Chen et al. 2019, Sun et al. 2021, Chen et al. 2022), which inspires the efforts to develop computational tools to predict brain connectivity in silico. Since the correlation between regional gene expression and connectivity has been established in mice and humans (Ji et al. 2014, Sun et al. 2023), the rapid development of spatial transcriptomics provides the opportunity to predict the brain connectivity from gene expressions, and identify the gene expression signatures underlying the brain connectivity with the publicly available brain spatial transcriptomics data (Hong et al. 2023, Yang et al. 2024, Han et al. 2025).

The growing evidences suggest that the formation and maintenance of neuronal connections depend not only on the properties of the start and end regions, but also depend on the sustained molecular supports from intermediate regions through which axons pass (Yu and Bargmann 2001, Dickson 2002, Bashaw and Klein 2010, Dent et al. 2011, Seiradake et al. 2016, Stoeckli 2018). During development, these intermediate regions can guide axon growth through the expression of guidance molecules (Yu and Bargmann 2001, Bashaw and Klein 2010, Stoeckli 2018). Importantly, even after the axons reach their final target regions, these intermediate regions continue to express a range of molecules that help maintain axonal structural stability, prevent aberrant retraction, and participate in branch pruning and synaptic plasticity regulation, a key process termed as axonal homeostasis (Yu and Bargmann 2001, Stoeckli 2018). These findings indicate that the spatial transcriptomes of the adult brain not only reflect a history of neural development but also encode the molecular information essential for the homeostasis of neuronal projections, thus implying that it is helpful to integrate these intermediate brain regions in connectivity prediction and exploration.

However, the single-neuron images, viral tracing connectivity and spatial transcriptomics are in different data formats and there is no computational tool to process these data in consistent format for integrative analysis. In addition, no method was developed to integrate the intermediate projection paths in connectivity prediction. Only Sun et al. (2023) used the random forest to determine the potential relationship between gene expression and brain connectivity, but this method still has limitations. First, this model was primarily good at binary prediction of connectivity (presence versus absence), but the brain connectivity is actually continuous strength among brain regions. Second, this model only utilized the viral-tracing connectivity but neglected the connectivity information of the single-neuron images. Thus, we developed a software to process single-neuron and viral tracing connectivity data in consistent format by using the same mouse brain coordinate system, and we also proposed a model to predict continuous connectivity strength via the spatial transcriptomics, with the power to select the potentially important genes in connectivity prediction. Specifically, we used the Allen mouse brain common coordinate framework (CCF) to process and integrate single-neuron and viral-tracing connectivity data to build connectivity paths and corresponding strengths. Then we used long short-term memory (LSTM) network to predict connectivity strength from spatial transcriptomics by accounting for the impact of connectivity path, in which the gradient method was used to measure the importance of gene expression in prediction. The results showed that our method well captured the connectivity paths from single neurons and accurately predicted the connectivity strengths between start and end brain regions. The method can also output the important gene expressions with biological implications in brain connectivity, helping the exploration of molecular basis of reconstructed connectivity maps.

2 Materials and methods

2.1 Single-neuron data processing

A total of 23 637 single-neuron imaging data of the prefrontal cortex, hippocampus and hypothalamus were downloaded from Digital Brain platform, which were generated using the two-photon fMOST. The raw SWC files acquired via two-photon fMOST imaging contain discrete point data representing neuronal structures. In the Digital Brain, each neuronal point is characterized by 7-column data structure comprising: point identification number, point type classification, 3D spatial coordinates $[eqn]$ , morphological radius, and parent node. To ensure data quality and consistency, we first preprocessed the original SWC file using the pyswcloader Python toolkit provided by Digital Brain (Cannon et al. 1998, Wang and Jiao 2025). We corrected the spatial coordinate axis orientation and constrained all 3D coordinates to the standard brain space dimension to eliminate bias caused by inconsistent coordinate systems. Then we used the Allen Mouse Brain Common Coordinate Framework (CCF) v3 to determine the brain region to which each node belonged and added the brain region identifier as a new attribute column to the data matrix, thus integrating the neuron geometric morphology and anatomical position information. To construct the projection path from the soma to neuronal terminals, we used the directed graph model to describe and traverse all possible projection paths of each neuron. Specifically, we first converted the SWC data of each neuron into a directed acyclic graph structure, in which each SWC node was initialized as a graph node, with attributes including 3D spatial coordinates $[eqn]$ and brain region information. Directed edges were established from parent nodes to child nodes according to the hierarchical relationships in the SWC file, where the root node was identified as a special node with a parent ID of −1.

Path extraction utilized both topological sorting and dynamic programming. First, the neuronal directed graph was topologically sorted to ensure that the node processing followed the correct hierarchical order from the soma to the terminals. Subsequently, we used the dynamic programming to accumulate all possible projection paths and process each node following a topological sorting order. At the branching point (a parent node with multiple child nodes), a new branch was generated for each child by replicating the path from the root to the parent and then appending the child node. The process of extending paths from parent node to child nodes continued until all nodes were processed, ultimately yielding a complete path set for each single neuron.

There exist highly heterogeneous projection paths among different neurons. Even for the neurons with the same start and end brain regions, the projection paths also exhibit remarkable diversity, with some neurons showing different intermediate brain regions. Particularly, we observed abundant inter-regional oscillating trajectories in which neurons traverse repeatedly between two brain regions before reaching their target brain regions, e.g. Region A → Region B → Region A → Region B (Fig. 1, available as supplementary data at Bioinformatics online). To simplify the problem, we used the greedy algorithm to extract the averaged major connectivity path from heterogeneous individual paths based on edge frequencies. Specifically, we grouped paths sharing the same start and end brain regions into the same dataset. For each dataset, we computed the frequencies of directed edges in the dataset, and the edge sets were next sorted in descending order of frequency. We then iteratively built a directed graph by sequentially appending edges in descending frequency order. At each iteration, we checked whether a connected path existed between the start and end brain regions. The process was terminated upon the first successful connection, and then we extracted the shortest path and corresponding path length, thereby identifying the minimal set of high-frequency edges required to form the major pathway. The brainrender (Claudi et al. 2021) was used to visualize the neurons and brain regions in this work.

To further evaluate the reliability of averaged major path, we performed statistics on the proportion of a single neuron traversing the brain regions defined by its averaged major path. For a given neuron image, the proportion of pixels located within the brain regions in its major path was calculated. The result showed that ∼70% of single neuron trajectories could be spatially covered by the averaged major paths in overall, in which around 32% of single neurons exhibited coverage >0.9 (Fig. 2, available as supplementary data at Bioinformatics online). This analysis indicates that the averaged major paths can reflect the core characteristics of connectivity paths shared by heterogeneous neurons.

2.2 Fluorescence data processing

The projection signals registered in the CCF brain regions were downloaded from Allen Mouse Connectivity Atlas by using the allensdk package (Wang et al. 2020). This dataset is a 3D high-resolution map of neuronal connections where the expression of recombinant Adeno-associated Virus or Adeno-associated Virus (rAAV/AAV) mediated fluorescent proteins were used to label the neuronal bodies and axons to quantify the connection strength through fluorescence intensity (Kuan et al. 2015). Overall, 242 cortical gray matter, 330 subcortical gray matter, 82 fiber bundles, 8 ventricles and their associated structures were annotated in 3D space. Each rAAV/AAV projection image was re-sampled and projected into the canonical space and then divided into 10 × 10 µm grids. The pixel number and intensity were recorded in each grid cell, and the summations of total pixel number and pixel intensity were calculated for the area manually marked as the injection site. The resulting 3D mesh was transformed to the standard reference space using a linear interpolation method to generate sub-mesh values, and then the grid values were categorized as projection density and injection fraction. The 25-micron resolution template in CCF brain regions was represented by matrix $[eqn]$ , where the x, y, and z axes correspond to the anterior-posterior, superior-inferior and left-right axes, respectively. In each matrix entry, i.e. the $[eqn]$ cubic voxel, the projection density was defined as the ratio of the detected fluorescent number to the total pixel number within the voxel. Then the mean projection density was used as the connectivity strength for the selected brain region, by excluding the voxels that were not significantly different from the background signal. In each experiment, we obtained the injection positions by determining the centroid position of the injection area based on the injection fraction and quantified the projected intensity of each area by the mean projection density. It should be noted that we excluded values with expression levels below the 75% quartile of the population for each injection site and retained only high signal strength regions to improve the signal-to-noise ratio in the preprocessing of projection intensity data.

2.3 Spatial transcriptomics data

We downloaded the Stereo-seq spatial transcriptomics dataset on whole mouse brain coronal sections in Digital Brain, which includes 195 slices covering many mouse brain regions. The raw data were stored in text file format, including gene names, UMI counts, and corresponding brain region identifiers (Han et al. 2025). To further integrate the data, we aggregated the gene expressions of the same anatomical regions and then mapped them to CCF brain regions. In this process, there were no data for some sub-regions due to differences in anatomical precision or inconsistent labeling systems. To preserve data integrity and maintain spatial continuity, the gene expressions of missing sub-regions were replaced with the expression level of its parent brain region. Then we obtained the gene expression matrices of the mouse brain regions. Given the fact that many genes have weak variations between brain regions and contribute little to the predictive model, we only retained the top 50% genes with the highest degree of variation among all brain regions. To enhance biological interpretability, we further removed the genes that with no clear annotated functions (e.g. 0610005C13Rik, Gm10010). After the filtering steps, a total of 5627 highly variable genes were retained. Subsequently, we calculated the Z-scores across brain regions to mitigate region-specific detection biases and then calculated the Z-scores across genes to reduce the biases in the remained gene expressions. Then the Principal Component Analysis (PCA) was used to reduce the gene expression dimension, and the first 64 principal components were used in the following connectivity strength prediction to balance the prediction accuracy and run time.

2.4 Connectivity strength prediction model

The impact of connectivity path was accounted for the connectivity strength prediction in this work. The gene expression latent representations in the selected connectivity path were processed through a three-layer fully connected network with dimensions of 64→256→128→64, in which the 64 representations were output for following calculations. The parameterized rectified linear unit (PReLU) was used as the activation function in each hidden layer. As for the root node $[eqn]$ in the selected connectivity path, the 64D representation was concatenated with the connectivity strength of root node for the matched brain region. The concatenated vector was input to two one-layer networks respectively to generate the initial hidden state 0 and initial cell state 0. The initial hidden state, initial cell state and the 64D representations of node $[eqn]$ were used together as the input in the LSTM network to predict the hidden state 1 and cell state 1. Then the output hidden state 1 was used to predict the connectivity strength of node $[eqn]$ by using a fully connected layer. For remaining node in the connectivity path, the LSTM network and connectivity strength prediction were implemented in the same way as node $[eqn]$ , with the shared weights for the connectivity strength prediction network and PReLU as the activation function. For the same start and end brain regions, there may exist different kinds of averaged major paths due to the complexity of neuronal projections, and these predicted connectivity strengths were further averaged and defined as the final connectivity strength from the start to end brain regions. To further enhance model flexibility, the output layer used a learnable scalar error correction coefficient k to adapt to the signal attenuation caused by the cross-regional spread of the rAAV/AAV, thereby correcting the systematic prediction bias. The Adam optimizer (with a learning rate set to 0.001) was used to optimize the loss function between the experiment-determined and predicted connectivity strengths.

2.5 Gene importance analysis

We quantitatively assessed the importance of gene expression in predicting connection strength through gradient backpropagation techniques. The trained prediction model randomly selected multiple sample groups from the input dataset, utilized GradientTape function in TensorFlow to monitor input embeddings, and performed forward inference to obtain predicted values for specific path nodes (Rumelhart et al. 1986). It then computed the gradient matrix of predictions with respect to the input gene embeddings, and aggregated importance scores by calculating the mean of absolute gradient values across embedding dimensions to identify latent feature dimensions that most significantly influence output variations. To enhance the stability and statistical reliability of the analysis, results derived from multiple principal components were subsequently averaged. The importance of each gene expression was directly evaluated using the loading matrix derived from PCA. The absolute weight value of each gene in key principal components served as the primary importance metric, reflecting the gene contribution to the variance captured by these components. For each key principal component, contribution values from multiple samples were aggregated and averaged.

3 Results

3.1 Software overview

The software BrainConnect constitutes of four modules: single-neuron data processing, connectivity path construction, rAAV/AAV fluorescence data processing, and connectivity strength prediction from spatial transcriptomics (Fig. 1). The single-neuron images derived from fMOST were stored in Raw SWC file, containing 3D coordinate information of neurons. This direct coordinate data format was mainly used for neuronal morphological analysis in existing studies, but it lacked the direct cross-region projection information. To solve this problem, we parsed the SWC file data to reconstruct the complete spatial morphology of neurons, traced their complete cross-region paths starting from the soma location (start brain region), and output the structured data suitable for integrative analysis (Fig. 1A). Next, we integrated all single-neuron paths to statistically sort the brain connectivity paths between the same start and end brain regions by using the reference CCF brain regions. Due to the projection complexity, there exist different paths between the start and end brain regions. However, we found that the majority of neurons with same start and end regions belonged to the same connectivity path. Then we constructed the maximum-probability connectivity path, defined as averaged major path in this work (Fig. 1B). As for the rAAV/AAV fluorescence data provided by the Allen Brain Atlas, we established an efficient and convenient data acquisition interface by organizing download links in our software and developed a pipeline to integrate the fluorescence data and single-neuron images to generate connectivity paths and strengths by using the reference CCF brain regions (Fig. 1C). Finally, we developed a framework to predict connectivity strength from spatial transcriptomics, and we also provided the interface to select the important gene expressions which might be involved in establishing, maintaining or regulating brain connectivity (Fig. 1D). We next described the detailed implementations and evaluations in following subsections.

Software modules. (A) Single neuron data processing. The software reconstructs and transforms single-neuron morphological data into path-based representations which are subsequently annotated using the Allen brain CCF to generate connectivity paths. (B) Connectivity path construction. By analyzing the path with same start and end brain regions, the software sorts the averaged major paths that capture the common features of single-neuron path population. (C) Connectivity strength integration. The software processes rAAV/AAV fluorescence from the Allen Brain Atlas to generate connectivity strength and then combines strength data with averaged major path derived from single neurons. (D) Connectivity strength prediction. The spatial transcriptomics data are processed to the regional gene expression matrix under the Allen brain CCF, and LSTM is used to predict connectivity strength between start and end brain regions by using spatial transcriptomics.

3.2 Connectivity path and strength construction

The high-resolution imaging and viral tracing methods provide connectivity data at different scales (Kuan et al. 2015, Liu et al. 2025). The single-neuron images provide path information for each neuron, including the soma and axonal terminal, but there are no direct connectivity strengths for these neurons. By contrast, the viral tracing data provides connectivity strengths between injection sites and remaining regions, but without connectivity path information. Currently, there is no method to combine these two kinds of data for connectivity analysis. In this work, we integrated the single-neuron images obtained through fMOST and brain projections labeled with rAAV/AAV fluorescence to construct mouse brain connectivity with both path and strength information (Fig. 2).

Connectivity path and strength integration. (A) Illustration of major connectivity path generation. The top subpanel presents single-neuron imaging data and Allen mouse CCF coordinates with 25-micron resolution. The bottom subpanel depicts the averaged major path from CA1 to AON sorted from single neuron paths. (B) Integration of connectivity path and strength. The projection strength derived from rAAV/AAV fluorescence were integrated with the depicted path under CCF coordinate. (C) Statistical analysis of averaged major paths from start to end brain regions derived from all single-neuron paths.

Specifically, by tracing the morphological information of single-neuron image, we registered each neuron into the 25-µm CCF brain regions to generate the sequential brain regions that the single neuron passed through (Wang et al. 2020), i.e. the connectivity paths of individual neuron (Fig. 1, available as supplementary data at Bioinformatics online). Since the neurons with same start and end brain regions can pass through different intermediate brain regions, we used the frequency-based greedy method to identify the averaged major path between start and end brain regions. First, we grouped the neurons sharing the identical start-end pairs. For each start-end pair, we calculated the occurrence frequency of all directed connectivity nodes, sorted them in descending order, and then sequentially added connectivity nodes in frequency-decreasing order. This process terminated when a path from start region to end region first appeared in the graph, which was used to define the averaged major path for the selected start-end pair (Fig. 2A). Meanwhile, we calculated the connectivity strengths at the 25-µm resolution by using the brain rAAV/AAV fluorescence data and then registered these strengths into the same CCF brain regions (Fig. 2B). In this way, we combined the connectivity path and strength with the same start and end brain regions (Fig. 2C).

3.3 Connectivity prediction model

To account for the connectivity path information, we utilized Long Short-Term Memory (LSTM) network to frame the connectivity strength prediction along the defined path as a sequential learning problem (Fig. 3). The gating mechanism of LSTM can effectively capture the long-range dependencies by overcoming the vanishing gradient problem inherent in processing long sequences. First, the whole-brain spatial transcriptomics data were processed and mapped to the CCF brain regions to generate the gene expressions for the matched connectivity path. The largest principal components derived from gene expressions were used as input for a three-layer fully connected network to generate gene expression embedding for each brain region. We also investigated the impact of dimension variations in this fully connected network on fitting performance, and the results showed that our model was robust to different dimension architectures (Fig. 3, available as supplementary data at Bioinformatics online). However, the current dimension choice achieved slightly higher convergence rate. Next, we concatenated the embedding vector and connectivity strength for the start region (the injection site). This concatenated vector was then passed through two fully connected networks to initialize the cell state and hidden state of the LSTM. With these two initialized states, the model processed the gene expression embeddings of sequential regions along the selected path and iteratively updated the cell state and hidden state using the LSTM. Finally, the model predicted the connectivity strength from the start region to the specific region based on the hidden state at each brain region.

The architecture of connectivity strength prediction. The spatial transcriptomics are processed under CCF coordinate, and the dimension-reduced gene expressions are input to a fully connected network. The initial gene expression embedding and connectivity strength were used to initialize the cell state and hidden state. Then these states are sequentially updated based on the gene expression embeddings by using LSTM.

3.4 Model evaluation and comparison

We conducted a systematic evaluation on the model performance in ipsilateral prediction. The result indicated that the Pearson correlation coefficient (PCC) between predicted and true connectivity strengths exceeded 0.729, demonstrating good fitting performance (Fig. 4, available as supplementary data at Bioinformatics online). To validate the generalizability of our model, we downloaded an additional single-neuron imaging dataset from Digital Brain (Wang and Jiao 2025) to construct connectivity paths independently, and the results showed that our model maintained highly predictive capability with PCC as 0.861 on the new dataset (Fig. 4A).

*Model evaluation and comparison. (A) Model validation by using independent single-neuron imaging dataset. (B) Model comparison under five-fold cross-validation. The LSTM, GRU, and RNN models incorporated sequential path information and brain region features step by step, whereas the XGBoost, Random Forest and Neural Network models used concatenated gene expression embeddings of the sequential brain regions as the input. The predictions utilize the path lengths ranging from 2 to 12 in A and B. The paired-t test was used in the two-sample comparisons. *P < 0.05, *P < 0.01. (C) Important gene selection from LSTM prediction model.

We next compared our model to the existing methods, Random Forest and simple Neural Network, and the XGBoost was also included in the evaluation. To perform fair comparison, we used Random Forest, XGBoost and Neural Network to respectively replace LSTM module while keeping all other computational steps unchanged in the framework. Since the Recurrent Neural Network (RNN) and Gate Recurrent Unit (GRU) can also be used in sequence problems, we included RNN and GRU to show the impact of connectivity path in connectivity strength prediction. All models used the same start regions (injection sites), end regions (target sites), and intermediate regions as input features (gene expressions). In Random Forest, XGBoost and Neural Network, we concatenated gene expression embeddings of all nodes in the given path into a vector as input for fair comparisons. Five-fold cross validation was used in comparison. The results showed that the LSTM model achieved highest prediction scores, followed by the GRU and RNN, and all these three methods exhibited higher prediction scores than XGBoost, Random Forest and Neural Network (Fig. 4B). Since the key distinction between the recurrent models (LSTM, GRU, RNN) and other three methods are the explicit incorporation of the sequential paths between start and end regions, the performance improvement indicates that the path information reconstructed from single-neuron images can substantially enhance connectivity strength prediction. Since LSTM performed better than GRU and RNN in current dataset, we finally implemented LSTM in our software. However, BrainConnect can be easily extended to utilize other methods in the prediction module.

Our prediction model also identified key gene expressions contributing to the connectivity strength prediction, which might be related to the regulation, establishment, or maintenance of brain connectivity. We presented the top genes selected by our model in Fig. 4C. For example, the transcription factor Fezf1 (Chen et al. 2005, Watanabe et al. 2009, Eckler et al. 2011, Chua et al. 2021), which is mainly expressed in the olfactory epithelium and hypothalamus regions of mice, ranked 21 in overall importance in our model. Previous study showed that the axons failed to project normally into the olfactory bulb in the Fezf1-deficient mouse model, suggesting the important role of gene Fezf1 in establishing axonal projections for olfactory neurons (Watanabe et al. 2009). In addition, Fezf1 is also involved in regulating axonal orientation and dendrite morphogenesis in pyramidal neurons, and the Fezf1-deficient mice exhibited weakened cord-thalamic connectivity (Hirata et al. 2004). Though there is short of direct evidence for other selected genes in regulating brain connectivity due to limited studies in this field, the output genes are implicated in influencing connectivity strength according to their biological functions (Fig. 5, available as supplementary data at Bioinformatics online). For example, the gene Dbh, encoding dopamine β-hydroxylase, showed highest contribution to prediction. It is reported that the absence of this enzyme leads to a serious defect in norepinephrine synthesis, which is widely involved in promoting synaptic plasticity and neurotransmitter release, key factors in regulating or maintaining brain connectivity (Hu et al. 2007, Carey and Regehr 2009, Liu et al. 2010). However, further works are needed to directly elucidate the roles of selected genes in brain connectivity.

4 Discussion and conclusion

Characterizing the brain connectivity is one of the central missions in neuroscience. The single-neuron imaging and viral tracing methods provide data resources on mapping brain connectivity at different scales, while spatial transcriptomics data provide gene expression information on brain regions. Thus, it is important to effectively integrate different kinds of data to elucidate the organization principle and molecular information underlying the mapped connectivity. In addition, the experimental mapping of brain connectivity is still costive, so connectivity prediction is helpful in exploring the brain connectivity and corresponding molecular information. Currently, there is no method and software to process these datasets in consistent data format and brain coordinate system, obscuring the integrative analysis on the brain connectivity exploration.

In this work, we utilized the complementary information between single-neuron images and rAAV/AAV fluorescence to construct the connectivity map with both path and strength information under the mouse CCF brain regions. The whole-brain spatial transcriptomics were also mapped to the same brain regions for further data integration. Then we proposed a computational tool to predict connectivity strength from spatial transcriptomics and select the important gene expressions in the prediction by using the reconstructed connection paths. We also performed different kinds of evaluations on our prediction model, and the results showed that our model predicted the connectivity strengths with much higher accuracy than the existing methods. Our model also selected the important gene expressions in connectivity prediction, which are implicated in regulating brain connectivity.

This work also has some limitations. We only validated our method in independent single-neuron imaging dataset but did not directly validate the reproducibility of our method in rAAV/AAV fluorescence and spatial transcriptomics due to the absence of additional whole-brain datasets in these two kinds of data sources. In addition, the available data sources were generated by different groups, and they are not exactly matched in mouse strain, age, sex, and other conditions. These data limitations make it difficult to reliably evaluate the effects of potential data noise and batch effect in model performance. So, we provided an interface to integrate the new data if they are available in the future. In this work, we showed that it was helpful to use the path information derived from population averages to predict the connectivity strength between the start and end brain regions. Though these paths capture the core characteristics of neural projections for connectivity prediction, they cannot completely represent the highly diversified projection trajectories of individual neurons. Better reconstructed paths can further facilitate the integrative analyses. It should also be noted that there are other kinds of connectivity mapping data in this field. For example, the electron microscopy can provide the synaptic-level reconstruction of neuronal connectomes (Witvliet et al. 2021). However, there is currently no whole-brain data for mouse in this kind of dataset. The BARseq and MAPseq also provided single-neuron projectomes, but these data only contained limited number of genes, making it out of the main purpose in this work (Kebschull et al. 2016, Chen et al. 2019, Sun et al. 2021, Chen et al. 2022), i.e. comprehensively analyzing the gene expression impact on brain connectivity. The development of the high-throughput single-neuron projectomes will facilitate further biological analyses for brain connectivity, such as cell-cell communications. In summary, with the development of different kinds of brain connectivity data together with spatial transcriptomics data, our software can be further extended to perform data integration and help the prediction and exploration of connectivity mapping in the future.

Supplementary Material

btag120_Supplementary_Data

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bashaw GJ , Klein R. Signaling from axon guidance receptors. Cold Spring Harb Perspect Biol 2010;2:a 001941.20452961 10.1101/cshperspect.a 001941 PMC 2857166 · doi ↗ · pubmed ↗
2Blaylock RL , Faria M. New concepts in the development of schizophrenia, autism spectrum disorders, and degenerative brain diseases based on chronic inflammation: a working hypothesis from continued advances in neuroscience research. Surg Neurol Int 2021;12:556.34877042 10.25259/SNI_1007_2021 PMC 8645502 · doi ↗ · pubmed ↗
3Cannon RC , Turner DA, Pyapali GK et al An on-line archive of reconstructed hippocampal neurons. J Neurosci Methods 1998;84:49–54.9821633 10.1016/s 0165-0270(98)00091-0 · doi ↗ · pubmed ↗
4Carey MR , Regehr WG. Noradrenergic control of associative synaptic plasticity by selective modulation of instructive signals. Neuron 2009;62:112–22.19376071 10.1016/j.neuron.2009.02.022PMC 2837271 · doi ↗ · pubmed ↗
5Chamberlin NL , Du B, de Lacalle S et al Recombinant adeno-associated virus vector: use for transgene expression and anterograde tract tracing in the CNS. Brain Res 1998;793:169–75.9630611 10.1016/s 0006-8993(98)00169-3PMC 4961038 · doi ↗ · pubmed ↗
6Chen J-G , Rašin M-R, Kwan KY et al Zfp 312 is required for subcortical axonal projections and dendritic morphology of deep-layer pyramidal neurons of the cerebral cortex. Proc Natl Acad Sci USA 2005;102:17792–7.16314561 10.1073/pnas.0509032102 PMC 1308928 · doi ↗ · pubmed ↗
7Chen X , Sun Y-C, Zhan H et al High-throughput mapping of long-range neuronal projection using in situ sequencing. Cell 2019;179:772–86.e 719.31626774 10.1016/j.cell.2019.09.023PMC 7836778 · doi ↗ · pubmed ↗
8Chen Y , Chen X, Baserdem B et al High-throughput sequencing of single neuron projections reveals spatial organization in the olfactory cortex. Cell 2022;185:4117–34.e 28.36306734 10.1016/j.cell.2022.09.038PMC 9681627 · doi ↗ · pubmed ↗