TL;DR
HNCcorr is a new combinatorial algorithm for cell detection in calcium imaging movies that outperforms existing methods by providing optimal solutions through correlation-based grouping.
Contribution
The paper introduces HNCcorr, a novel algorithm based on combinatorial optimization for improved cell identification in calcium imaging data.
Findings
Achieves best results on Neurofinder benchmark.
Guarantees an optimal solution to the cell detection problem.
Provides transparent mapping from data to identified cells.
Abstract
Calcium imaging has emerged as a workhorse method in neuroscience to investigate patterns of neuronal activity. Instrumentation to acquire calcium imaging movies has rapidly progressed and has become standard across labs. Still, algorithms to automatically detect and extract activity signals from calcium imaging movies are highly variable from~lab~to~lab and more advanced algorithms are continuously being developed. Here we present HNCcorr, a novel algorithm for cell identification in calcium imaging movies based on combinatorial optimization. The algorithm identifies cells by finding distinct groups of highly similar pixels in correlation space, where a pixel is represented by the vector of correlations to a set of other pixels. The HNCcorr algorithm achieves the best known results for the cell identification benchmark of Neurofinder, and guarantees an optimal solution to the…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20| Paper | Assumptions on temporal component matrix | Assumptions on background matrix | Objective norm | Regularization | Initialization |
|---|---|---|---|---|---|
| Maruyama et al [11] | None | - Rank 1 matrix | norm | None | Random |
| Pnevmatikakis et al. [12, 13] | AR() process with spikes | - Constant for each pixel | norm | Bayesian prior on sparsity of spikes and nuclear norm penalty on sparsity of | Clustering method |
| Diego-Andilla & Hamprecht [14] | Sparse low-rank matrix | - Rank 1 matrix | norm | Total variation norm on the spatial component of the background | Random |
| Pnevmatikakis et al. [25, 15] | AR() process with spikes | - Rank 1 matrix | norm | Constraints on variance of noise vector | Greedy algorithm |
| Pachitariu et al. [16] | None | Set of isotropic 2d raised cosine functions | norm | None | Not specified |
| Dataset name | Source of dataset | Length [s] | Frequency [hz] | Brain region | Reference technique |
|---|---|---|---|---|---|
| 00.00.test | Svoboda Lab | 438 | 7.00 | vS1 | Anatomical markers |
| 00.01.test | Svoboda Lab | 458 | 7.00 | vS1 | Anatomical markers |
| 01.00.test | Hausser Lab | 300 | 7.50 | v1 | Human labeling |
| 01.01.test | Hausser Lab | 667 | 7.50 | v1 | Human labeling |
| 02.00.test | Svoboda Lab | 1000 | 8.00 | vS1 | Human labeling |
| 02.01.test | Svoboda Lab | 1000 | 8.00 | vS1 | Human labeling |
| 03.00.test | Losonczy Lab | 300 | 7.50 | dHPC CA1 | Human labeling |
| 04.00.test | Harvey Lab | 444 | 6.75 | PPC | Human labeling |
| 04.01.test | Harvey Lab | 1000 | 3.00 | PPC | Human labeling |
| Metric | Algorithm | Datasets | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 00.00 | 00.01 | 01.00 | 01.01 | 02.00 | 02.01 | 03.00 | 04.00 | 04.01 | ||
| F1 | HNCcorr | 29 | 33 | 53 | 56 | 75 | 68 | 23 | 38 | 68 |
| CNMF | 28 | 35 | 48 | 47 | 62 | 54 | 28 | 39 | 62 | |
| Suite2P | 32 | 39 | 49 | 39 | 60 | 52 | 23 | 47 | 66 | |
| Recall | HNCcorr | 19 | 25 | 39 | 42 | 94 | 89 | 15 | 50 | 58 |
| CNMF | 18 | 24 | 37 | 36 | 75 | 67 | 20 | 45 | 59 | |
| Suite2P | 20 | 27 | 39 | 38 | 83 | 72 | 15 | 55 | 58 | |
| Precision | HNCcorr | 62 | 52 | 80 | 84 | 62 | 55 | 47 | 31 | 83 |
| CNMF | 69 | 65 | 69 | 67 | 53 | 45 | 49 | 35 | 65 | |
| Suite2P | 80 | 72 | 68 | 40 | 47 | 40 | 57 | 41 | 76 | |
| Inclusion | HNCcorr | 63 | 65 | 58 | 56 | 78 | 76 | 55 | 68 | 74 |
| CNMF | 73 | 79 | 56 | 51 | 79 | 80 | 64 | 70 | 77 | |
| Suite2P | 72 | 78 | 56 | 45 | 78 | 77 | 78 | 68 | 86 | |
| Exclusion | HNCcorr | 69 | 65 | 93 | 94 | 84 | 85 | 35 | 90 | 87 |
| CNMF | 66 | 68 | 96 | 97 | 87 | 87 | 43 | 90 | 88 | |
| Suite2P | 65 | 64 | 92 | 95 | 81 | 82 | 38 | 92 | 83 | |
| Metric | Algorithm | Datasets | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 00.00 | 00.01 | 01.00 | 01.01 | 02.00 | 02.01 | 03.00 | 04.00 | 04.01 | ||
| F1 | HNCcorr + conv2d | 55 | 61 | 53 | 56 | 75 | 68 | 82 | 38 | 68 |
| Sourcery | 45 | 53 | 62 | 45 | 72 | 56 | 84 | 39 | 69 | |
| Suite2P + Donuts | 45 | 53 | 49 | 39 | 60 | 52 | 84 | 47 | 66 | |
| Recall | HNCcorr + conv2d | 43 | 51 | 39 | 42 | 94 | 89 | 76 | 50 | 58 |
| Sourcery | 37 | 44 | 52 | 68 | 85 | 81 | 85 | 57 | 57 | |
| Suite2P + Donuts | 37 | 44 | 39 | 38 | 83 | 72 | 85 | 55 | 58 | |
| Precision | HNCcorr + conv2d | 75 | 76 | 80 | 84 | 62 | 55 | 86 | 31 | 83 |
| Sourcery | 58 | 67 | 77 | 34 | 62 | 43 | 83 | 29 | 86 | |
| Suite2P + Donuts | 58 | 67 | 68 | 40 | 47 | 83 | 57 | 41 | 76 | |
| Inclusion | HNCcorr + conv2d | 55 | 56 | 58 | 56 | 78 | 76 | 75 | 68 | 74 |
| Sourcery | 98 | 98 | 81 | 72 | 96 | 95 | 100 | 92 | 98 | |
| Suite2P + Donuts | 98 | 98 | 56 | 45 | 78 | 77 | 100 | 68 | 86 | |
| Exclusion | HNCcorr + conv2d | 78 | 77 | 93 | 94 | 84 | 79 | 35 | 90 | 87 |
| Sourcery | 36 | 35 | 81 | 84 | 56 | 58 | 31 | 73 | 63 | |
| Suite2P + Donuts | 36 | 35 | 92 | 95 | 81 | 82 | 31 | 92 | 83 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
HNCcorr: A Novel Combinatorial Approach for Cell Identification in
Calcium-Imaging Movies
Quico Spaen1, Dorit S. Hochbaum1, Roberto Asín-Achá2
- 1
Department of Industrial Engineering & Operations Research, University of California, Berkeley, Etcheverry Hall, CA 94720, USA
- 2
Department of Computer Science, Universidad de Concepción, Concepción, Chile
Abstract
Calcium imaging has emerged as a workhorse method in neuroscience to investigate patterns of neuronal activity. Instrumentation to acquire calcium imaging movies has rapidly progressed and has become standard across labs. Still, algorithms to automatically detect and extract activity signals from calcium imaging movies are highly variable from lab to lab and more advanced algorithms are continuously being developed. Here we present HNCcorr, a novel algorithm for cell identification in calcium imaging movies based on combinatorial optimization. The algorithm identifies cells by finding distinct groups of highly similar pixels in correlation space, where a pixel is represented by the vector of correlations to a set of other pixels. The HNCcorr algorithm achieves the best known results for the cell identification benchmark of Neurofinder, and guarantees an optimal solution to the underlying deterministic optimization model resulting in a transparent mapping from input data to outcome.
Introduction
In the past decade new methods to measure neuronal activity have emerged based on optical imaging of activity-dependent sensors [1, 2]. In particular, optical imaging using genetically-encoded calcium indicators, known as calcium imaging, has become a key tool for circuits and systems neuroscience and is currently used by hundreds of labs across the world [1]. While instrumentation and the calcium indicators have matured rapidly, algorithms to automatically detect and extract activity signals from calcium imaging movies are highly variable from lab to lab. More advanced algorithms are continuously being developed. As a result, the analysis of calcium imaging movies is non-standardized and requires extensive human supervision [1, 3].
Calcium imaging movies collected during an experiment require post-processing to extract the fluorescence signals of the individual cells in the movie. This process consists of two steps. First, in the cell identification problem, the goal is to identify the spatial footprint of the cells in the movie. The spatial footprint of a cell is the set of pixels that contain this cell. Then, in the signal extraction step, the fluorescence signal of each cell is extracted based on the intensities of the pixels in the spatial footprint of the cell. For the signal extraction step a number of effective algorithms have been proposed [4, 5, 6, 7]. For the cell identification problem, in addition to manual identification of the cells, two types of automated procedures have been proposed: Shape-based segmentation algorithms [8, 9], and signal-based algorithms. The signal-based algorithm use one of two techniques: Matrix factorization [10, 11, 12, 13, 14, 15, 16] or graph partitioning [17]. See the literature review for a detailed discussion of these methods.
In this paper we present HNCcorr, a new signal-based method for cell identification in calcium imaging movies. Our method is novel in that it identifies the spatial footprint of a cell by mapping the pixels to correlation space, where a pixel is represented by the vector of correlations to a set of other pixels, and the use of a graph-partitioning model that can be solved optimally. In contrast, all other techniques rely on heuristics that do not guarantee optimal solutions for the underlying models. By mapping the pixels to correlation space, the pixels of each cell group together and also the background pixels congregate. HNCcorr then uses the HNC model [18, 19] to identify cells one at a time by finding a set of pixels that are close in correlation space, but whose correlation patterns are distinct from the rest of the pixels. This type of model originates in image segmentation and is also effective in general data mining [18, 19, 20, 21, 22, 23]. This model is solved optimally with a fast combinatorial optimization algorithm. A Python implementation of HNCcorr is available on GitHub: https://github.com/quic0/HNCcorr.
Experimentally, the HNCcorr outperforms other state-of-the-art approaches in the Neurofinder challenge [24]. It is able to accurately detect more cells than competing algorithms. When combined with the shape-based algorithm Donuts [8] for detecting cells without a signal, HNCcorr achieves the best known average score for the Neurofinder benchmark as of January 2017.
Literature Review
The known approaches for the cell identification problem can be classified into two different categories: Shape-based algorithms and signal-based algorithms. The algorithms in the first category identify the cell bodies by looking for the shape of a cell in an input image, which is typically the image containing the average intensity of each pixel over time. The second category of algorithms identifies cells by identifying clusters of pixels with a similar fluorescence signal over time. The signal-based algorithms can be split into two types of techniques: Matrix factorization and graph partitioning. Next, each of these techniques is discussed in detail.
Shape-based identification:
Pachitariu et al. [8] propose a generative model that aims to capture regularities in an image composed of many elements of few different types. The model assumes that a shape (i.e. a cell body) is built from a set of learned features (e.g. edges, curved edges, corners, etc). Based on the learned features, the model can predict the probability that a group of pixels represents a cell body. The algorithm Donuts uses these learned features to identify a set of shapes that should correspond with cell bodies.
More recently, an algorithm named Conv2d that applies convolutional neural networks in a two-dimensional image has been proposed [9]. The network is trained on data where the cell bodies have been identified. Based on this trained network, it predicts the probability that each pixel belongs to a cell and constructs the spatial footprints of the cells.
Both of these approaches tend to work well whenever the cell body is easily distinguishable in the input image. However, cells that are not clearly visible in this image cannot be found. In practice, due to the acquisition technique, many cells are only visible in the small number of frames where they are activated. These cells do not show in the image containing the average intensity of each pixel and can thus not be identified. Furthermore, these techniques are also not well-suited for detecting overlapping cells, since these cells can only be separated based on their signals.
Matrix factorization:
Matrix factorization is a set of techniques that factorize a matrix as a product of multiple low-dimensional matrices. These techniques provide the spatial footprint and the signals of the cells. Mukamel et al. [10] was the first paper that introduced the idea that the intensity of a pixel over time is a noisy measurement of a mixture of signals. The movie can be split into a set of signals, each having a spatial and a temporal component. The spatial component measures the extent to which a signal contributes to the intensity of each of the pixels and the temporal component represents the value of the signal over time. To extract the cells’ signals from the movie, the method first uses Principal Component Analysis (PCA) to reduce the dimensions of the data and noise. Then, the signals are identified using spatio-temporal independent component analysis (ICA).
In recent work [14, 11, 12], it was recognized that this idea can be formalized as matrix factorization. They consider the following generative model for the signal of the intensities of the pixels:
[TABLE]
where is a non-negative matrix containing the fluorescence of all pixels over time, is a non-negative matrix that measures the contribution of each cell’s signal to the intensity of each pixel, is a matrix that captures the signal of the cells at each time step, and is a background signal. The matrix is known and we wish to infer the values of the matrices , where and respectively capture the spatial and temporal components of each of the cells. The matrices are inferred by minimizing the objective function , where is a matrix norm and is a regularization function on the matrices , , and/or . An iterative procedure known as coordinate descent (a special case is Alternating Least Squares) is applied to this problem. Starting from an initial solution, the iterative algorithm alternates between fixing and optimizing over and fixing and optimizing over until the solution converges. All of the papers listed above present a variant of this approach. Table 1 summarizes their differences.
Although these techniques claim adequate results, the models and their solution techniques suffer from major shortcomings. The optimization problem of matrix factorization is inherently non-convex and computationally intractable, which means it cannot be solved optimally. Instead, the associated algorithms are heuristics that only find a local minimum close to the initial solution. As a result, the quality of the solution is highly dependent on the initial solution and may not generalize across datasets.
Furthermore, these algorithms are based on a set of assumptions on the dynamics of the fluorescent calcium indicator. These assumptions do not necessarily capture the actual dynamics. In contrast, the HNCcorr algorithm makes no specific assumptions about the underlying dynamics of the fluorescent indicator.
Graph partitioning:
This group of techniques identifies cells by partitioning a graph in which the nodes correspond to the pixels and the edges measures the similarity between pairs of pixels. In [17], Kaifosh et al. propose a model in which they find the ROIs, consisting of one or more cells, by repeatedly subdividing the set of pixels into coherent groups. To do this, they use an eigenvector-based heuristic for an optimization problem known as Normalized Cut (NC) [26]. This approach suffers from two major drawbacks. First, the Normalized Cut model is not appropriate for this problem (see the section on methodology for details). Second, they rely on a heuristic for solving the optimization problem, which may lead to suboptimal results.
The HNCcorr algorithm is similar to the algorithm of [17] in that it partitions the pixels based on a similarity graph. However, the algorithm uses both a different graph and a different optimization model. As a result, it is not affected by the shortcomings listed above.
In contrast to the existing approaches, the method proposed here solves a discrete optimization problem and guarantees an optimal solution. The algorithm is robust in that it does not rely on initialization and requires minimal parameter tuning. Furthermore, our algorithm uses the novel idea to identify cell bodies based on the similarity between pixels in correlation space.
Methodology
HNCcorr aggregates a set of pixels in a cluster so they are highly similar to each other, and highly non-similar to the pixels not in the cluster. These clusters form the spatial footprints of cells. HNCcorr identifies these clusters one at a time. It is noted that the intensity of a pixel in the spatial footprint of a cell is a noisy observation of the signal of the cell over time. This implies that pixels in the spatial footprint of a cell should have the same intensity pattern over time, and that the intensities of these pixels should be highly correlated. To that end, we associate with each pixel the vector of signal correlations with other pixels in the window, referred to as the correlation space. The distances between those vectors are used in HNCcorr as the similarity measure for the clustering problem. Next, we first describe the core components of the HNCcorr algorithm in detail.
Clustering pixels with HNC
The pixels in the spatial footprint of a cell have highly similar signals. We use the HNC model [18, 19] to identify the clusters of pixels in a graph. This graph is an undirected graph where the node set represents the pixels and the edges in set represent similarity relations between pixels. With each edge , we associate a similarity weight where is close to if pixels are highly similar and close to [math] if they are not.
The goal in the HNC model is to find a partition of the nodes (pixels) into the set , which represents the spatial footprint of a single cell, and the set , which are the remaining pixels, so that the similarity between pixels in and the pixels in is as low as possible, low inter-similarity, and the pixels in the spatial footprint are highly similar to each other, high intra-similarity. This is expresses as the following optimization problem:
[TABLE]
The parameter determines the tradeoff between the inter-similarity and the intra-similarity. Note that minimizing the negative of the intra-similarity term is equivalent to maximizing it.
HNC is closely related to a well-known optimization problem named Normalized Cut (NC) from the field of image segmentation [26, 27]. NC has been previously used for calcium imaging movies to find regions that contain one or more cells by Kaifosh et al. [17]. NC is not an appropriate model for the cell identification problem because it would try to assure that the remaining pixels (i.e. those in ) are also highly similar. Furthermore, NC is known to be NP-hard which means that it is computationally intractable. In practice, it is solved with a heuristic based on eigenvectors [26].
In contrast, very efficient algorithms exist to optimally solve the HNC problem. These algorithms [18, 19] arise from reducing the HNC problem to the well-known minimum cut/maximum flow problem. These algorithms are especially powerful in that they guarantee to find an optimal set . This is a major advantage since it creates a unique and transparent mapping from input to output. As a result, it is straightforward to understand how changing the input affects the output of the algorithm. Furthermore, we can use parametric cut/flow algorithms for the minimum cut problem [28, 29] that find the optimal solutions for all values of at once, removing the need for tuning the parameter.
Seed and window selection
At each iteration the goal is to segment a single cell not previously identified. To ensure that the model identifies the correct cell, HNC needs as input a small set of representative pixels, seeds, for and . For we select as seed a superpixel, a small square of pixels where . Note that when a superpixel is a singleton pixel. We refer to the selected superpixel for as the seed.
For we select a small set of pixels, referred to as the negative seeds. These pixels are picked uniformly from a circle of a sufficiently large radius centered at the seed, such that the cell to be identified, if exists, can be assumed to be contained in the circle.
For each cell to be segmented, only the pixels within a certain vicinity of the seed need to be considered. For that purpose a sufficiently large window is constructed around the seed. The window is an rectangular subregion of the movie that contains the seed at its center. The HNC clustering problem is restricted to the graph defined on the pixels in this window.
To find all cells, one can simply try all possible (super)pixels of a given size as seeds. Since the procedure for identifying a single cell is computationally very efficient, this exhaustive search is possible. Nevertheless, this procedure performs a significant amount of unnecessary computation. For example, selecting two seeds close to each other is likely to give the same outcome. Hence, it is sufficient to consider at most a few pixels per small subregion of the movie. Furthermore, a good indication of whether the seed is part of a cell is to compute the average correlation between the seed and the surrounding neighborhood of e.g. pixels. In most cases, it would be sufficient to try only the percent of pixels with the highest average correlation as seeds to detect nearly all of the active cells. Finally, pixels that are part of a previously found cell should be excluded to prevent the same cell from being segmented multiple times.
Measuring similarity: Correlation space
A standard technique for measuring similarity between signals is to measure the correlation between them. However, this is not an appropriate measure for the similarity measure . The problem with correlation is that pixels that are not part of any cell are not sufficiently correlated with each other. As a result, these pixels are as likely to be clustered in (the spatial footprint) as in . To overcome this, we introduce correlation space.
In correlation space, a pixel is represented by a vector containing the correlations between the pixel’s intensity timeseries and the timeseries of all pixels in the window. That is, every pixel is described by a vector of correlations , where is the number of pixels in the window. The pixels that are part of a spatial footprint will have high values for the entries that correspond to pixels that are in the spatial footprint and low values for the entries of pixels that are not in the spatial footprint. As a result, the pixels in each spatial footprint as well as the background pixels cluster in correlation space. See Figures 1 and 2 for visualizations of this phenomenon.
To define we measure the Gaussian similarity between pixels and in correlation space. Precisely, we define as:
[TABLE]
where is a scaling parameter that is typically set to . Note that is close to one when the distance between pixels and in correlation space is small and near zero when the distance is large.
Sparse Computation
For the edge set of the graph on which we solve the HNC model, one option is to naively select to consist of all possible pairs of pixels. However, this can be computationally costly and unnecessary since most pairs of pixels are highly dissimilar. Instead, we use a methodology called sparse computation [22, 23]. Sparse computation allows us to compute only a small subset of all possible pairs. It uses the observation that for many pairs the pairwise similarity is close to zero and that removing these edges does typically not affect the outcome. For general machine learning, sparse computation significantly reduces the running time of similarity-based classifiers [22, 23], such as SNC [30], KNN, and kernel SVM. In sparse computation, the data, here consisting of the correlation vectors, is efficiently projected onto a low-dimensional space of dimension , typically using a fast approximation of principal component analysis. The low-dimensional space is then subdivided into sections per dimension resulting in grid blocks. We add to the edge set edges between pixels in the same or neighboring grid blocks.
HNCcorr algorithm
Having introduced the key ideas behind HNCcorr, we now describe the HNCcorr algorithm.
The HNCcorr algorithm iterates through a list of seeds. For each seed, the algorithm performs the steps described below to identify a cell whose spatial footprint contains the seed or to conclude that no such cell exists. The algorithm first constructs a window around the seed and picks the negative seeds. Next, the algorithm maps each pixel to correlation space. It then determines which edges are in the graph with the sparse computation technique. For each of these edges, is computed. Now that the graph is well-defined, the algorithm calls a subroutine to solve the corresponding HNC problem for all . We can obtain an optimal set for each value of , since the sets are nested (, ) and therefore there can be at most such sets.
Finally, a simple oracle decides the best spatial footprint for the cell based on these sets or concludes that there is no cell located at the seed. Currently, the oracle simply checks if the number of segmented pixels is between a given lower and upper bound based on the expected size of a cell. In case none of the segmentations satisfy this criteria, then the oracle concludes that no cell containing the input seed exists. In case more than one segmentation satisfies the size criteria, the oracle returns the spatial footprint where the number of segmented pixels is closest to a given expected cell size. Note that more complex oracles can be used as well.
The output of the algorithm is either the best segmentation or an empty set indicating that no cell has been identified. This process repeats for each of the selected seeds.
A visualization of a segmentation for a given seed is given in Figure 3. Figure 4 shows an example of the output of the HNCcorr algorithm after processing all seeds in a calcium imaging movie.
Experimental Results
To evaluate the experimental performance of HNCcorr in detecting cells in calcium imaging movies, we compare against the best known algorithms to date. Specifically, we test HNCcorr as well as the matrix factorization algorithms Suite2P [16] and CNMF [15] on the datasets of the Neurofinder public benchmark [24]. The Neurofinder benchmark is an initiative of the CodeNeuro collective of neuroscientists that encourages software tool development for neuroscience research. Based on the Neurofinder benchmark, we show that HNCcorr outperforms the other algorithms with a relative improvement in average F1-score across datasets of at least 14 percent.
The Neurofinder benchmark consists of nine testing datasets and nineteen training datasets. Each dataset consists of a calcium imaging movie. For the training datasets, a reference list of the spatial footprints of the cells in the movie, is provided with the dataset. For the testing datasets, the reference list (i.e. the spatial footprints of the cells) is not disclosed. The datasets have been contributed by various labs and have been recorded under different experimental conditions. They differ in sample frequency, length of the movie, magnification, signal-to-noise ratio, region of the brain that was recorded, and how the footprints in the datasets were determined. Table 2 contains the characteristics of the testing datasets.
The datasets can be split into two groups based on the activity of the cells. The reference list for the 00 and 03 dataset series contains many cells that have a weak or non-existent signal, whereas most cells have a detectable signal in the 01, 02, and 04 dataset series.As a result, algorithms based on signal detection are expected to perform poorly on the 00 and 03 dataset series.
To evaluate an algorithm on the Neurofinder benchmark, the algorithm should identify the cells in each testing dataset. The spatial footprints of the cells as detected are then submitted to Neurofinder. The submission is then automatically evaluated by comparing it against the reference list of spatial footprints provided by the dataset contributors. Since these lists are not publicly disclosed for the testing datasets, this guarantees an unbiased and fair evaluation. Furthermore, this procedure removes the need for replicating the results of other algorithms.
A submission to Neurofinder is evaluated for each testing dataset based on metrics for detection quality and metrics for the segmentation quality111See the supplementary material for results on detection quality.. To measure the detection quality of the algorithm we consider the following metrics, [24]:
Recall
The percentage of the cells in the undisclosed reference list of spatial footprints that are recovered by the algorithm.
Precision
The percentage of the cells identified by the algorithm that are also present in the undisclosed reference list.
F1-score
The harmonic mean of precision and recall. This is a standard metric in machine learning for evaluating the overall performance of an algorithm.
For the recall and precision metrics, a cell is considered identified if the center of mass of the spatial footprint as determined by the algorithm is within five pixels of the center of mass of the spatial footprint in the reference list.
The experimental performance of the signal-based algorithms HNCcorr, CNMF, and Suite2P on the 01, 02, 04 dataset series is shown in Figure 5. The implementation details for the HNCcorr submission are described in the supplementary material. Since the 00 and 03 dataset series are not appropriate for signal-based algorithms, these dataset series have been excluded. Results on all testing datasets are provided in the supplementary material.
Overall, the HNCcorr algorithm has superior performance across the datasets. The HNCcorr achieves a 14 percent relative improvement in average F1-score compared to the Suite2P and CNMF algorithm. The HNCcorr algorithm also achieves the highest F1-score for each of the datasets with the exception of the dataset 04.00. This improvement is mainly due to an increase in precision across all datasets except 04.00. All of the algorithms have a significantly lower precision for the 04.00 dataset as compared to the other datasets. Based on analysis of the 04.00 training dataset, a possible explanation is that some cells with signal are identified by the algorithms but these cells are not in the reference listing. Furthermore, the HNCcorr algorithm achieves a significant improvement in recall for the 02.00 and 02.01 datasets, resulting in near-perfect recall.
A common strategy for the Neurofinder benchmark is to submit an ensemble of algorithms to mitigate the poor performance of the signal-based algorithms on the 00 and 03 dataset series. Typically, the ensemble consists of a shape-based detection for the 00 and 03 series and a signal-based algorithm for the 01, 02, and 04 dataset series. In Figure 6, we compare the results of HNCcorr combined with conv2d with two other ensembles of algorithms based on Suite2P. These three submissions are leading on the Neurofinder leaderboard as of January 2017, and the ensemble of the HNCcorr algorithm and the conv2d shape-based algorithm achieves the highest score for the Neurofinder benchmark.
Discussion
The algorithm HNCcorr is a new algorithm for cell identification in calcium imaging movies. In contrast to previous methods, HNCcorr is not based on matrix factorization. Instead, the HNCcorr algorithm combines graph clustering based on combinatorial optimization with the use of correlations. Here we discuss several characteristics of HNCcorr.
First, the distinguishing feature of HNCcorr is that its underlying optimization model, the HNC model, can be solved optimally with a combinatorial optimization algorithm. To the best of our knowledge, all other techniques rely on heuristics for finding a solution to their model. Solving a model optimally is important because the solution is fully determined by the model and does not depend on the solution technique. This provides a transparent mapping from the model and the input data to the outcome. As a result, unsatisfactory results can therefore be analyzed solely based on the model, without having to consider any uncertainty introduced by the solution technique.
Second, the HNCcorr algorithm has a different computational structure. The key difference compared to other algorithms is that the algorithm detects the cell bodies of the neurons one at a time. An immediate advantage is that multiple seeds can be evaluated simultaneously. This allows for efficient parallelization of the algorithm. Another characteristic of HNCcorr is that it has an oracle that filters segmentations that are not cells. This is used since HNCcorr attempts to identify a cell based on any seed. As demonstrated, a simple oracle based on the size of the segmentation is sufficient to attain state-of-the-art precision. We believe that a more advanced oracle could significantly improve the precision of the algorithm, and we plan to investigate this further. Another observation is that the HNCcorr algorithm scales linearly in the length of the movie whereas matrix factorization algorithms do not.
A critical parameter for the HNCcorr algorithm is the size of the window. The window size sets the dimension of the correlation space and determines the computational cost of mapping the pixels to correlation space. If this becomes a significant computational issue, then we plan to explore the following ideas: One is to limit the computation of correlation only to the relevant similarities and set the remaining correlations to zero. For example, one could use an unbiased estimator for the correlation based on a random subset of the frames and compute only those correlation that are significantly different from zero. Another option is to limit the dimension of the correlation space by computing only the correlations with respect to a (possibly random) subset of the pixels.
An often stated advantage of matrix factorization based algorithms is that these algorithms can identify overlapping cells based on their signal. Although the percentage of cells that overlap is typically small, the HNCcorr algorithm is also able to identify overlapping cells, see Figure 7 for an example. The key observation here is that pixels can be segmented more than once and, thus, belong to multiple cells.
Finally, a novel feature of HNCcorr is the use of correlation space. Through the mapping to correlation space, HNCcorr is the first algorithm that employs correlation to attain state-of-the-art results. As previously discussed in Figures 1 and 2, the success of employing correlation space is that closeness in this space is a strong indicator of whether pixels correspond to the same cells or background.
Conclusion
In this paper we present HNCcorr for identifying cells in calcium imaging movies. HNCcorr segments cells one at a time by finding distinct sets of pixels that are nearly identical in correlation space. The algorithm is based on a combinatorial optimization problem known as HNC that can be solved efficiently. As a result, the algorithm guarantees to find an optimal solution to the underlying optimization problem. Experimentally, HNCcorr has superior performance compared to matrix factorization based algorithms that have been considered state-of-the-art. For cells segmented based on activity, HNCcorr achieves a relative improvement of at least 14 percent in average F1-score compared to CNMF and Suite2P. Combined with the Conv2d shape-based detection algorithm, HNCcorr achieves the best known results to date in the Neurofinder benchmark.
For further research, we plan to adapt HNCcorr to calcium imaging movies collected with light-field microscopy [31] as well as for the detection of subcellular components (e.g. dendrites). We also plan to explore a more powerful oracle that decides whether a segmentation corresponds to a cell which could potentially lead to a significant improvement in the performance of the algorithm. Finally, we plan to investigate other domains where HNCcorr could be of value.
Acknowledgments
The authors would like to thank Christopher Harvey, Selmaan Chettih, and Matthias Minderer from Harvard Medical School for their guidance, the data they provided, and for the valuable discussions we have had.
Supplementary Material
Implementation details
HNCcorr was implemented in Matlab R2015a. Here we describe the relevant implementation details.
A new Python version of the code has been made available on GitHub: https://github.com/quic0/HNCcorr
Preprocessing:
The datasets were preprocessed by averaging every ten frames into a single frame to reduce the noise. Each preprocessed dataset is stored in HDF5 format for efficient data access.
Seed selection:
To identify possibly effective seeds, we use the procedure described in Algorithm 1.
Parameters settings:
For sparse computation we use three dimensions () for the low-dimensional space and a grid resolution of . For the negative seeds, we select a set of nine pixels uniformly distributed from a large circle.
The remaining parameters depend on the dataset and are defined as follows:
[TABLE]
Neurofinder Benchmark results
The tables below show the performance of the algorithms on all datasets for both the cell detection and segmentation quality metrics. The metrics for cell detection have been defined in the experimental results section. To measure the segmentation quality, the Neurofinder benchmark uses the metrics inclusion and exclusion [24].
Metrics segmentation quality
Let be the set of pixels in the spatial footprint of a cell for the reference labeling and be the set of pixels of the spatial footprint of a cell for the algorithm. Then, the metrics inclusion and exclusion are defined as follows:
Inclusion
Average of across all cells correctly identified by the algorithm.
Exclusion
Average of across all cells correctly identified by the algorithm.
These metrics are of lower importance since the critical task is finding the location of a cell. It may still be possible to extract the signal of the cell even if the shape is not detected perfectly. Also, the spatial footprint of the cell in the reference labeling is not always fine-tuned.
Results
For each of the listed metrics, higher scores are better.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Hamel, E. J., Grewe, B. F., Parker, J. G. & Schnitzer, M. J. Cellular level brain imaging in behaving mammals: an engineering approach. Neuron 86 , 140–159 (2015).
- 2[2] Scanziani, M. & Häusser, M. Electrophysiology in the age of light. Nature 461 , 930–939 (2009).
- 3[3] Peron, S., Chen, T.-W. & Svoboda, K. Comprehensive imaging of cortical networks. Current opinion in neurobiology 32 , 115–123 (2015).
- 4[4] Vogelstein, J. T. et al. Fast nonnegative deconvolution for spike train inference from population calcium imaging. Journal of neurophysiology 104 , 3691–3704 (2010).
- 5[5] Grewe, B. F., Langer, D., Kasper, H., Kampa, B. M. & Helmchen, F. High-speed in vivo calcium imaging reveals neuronal network activity with near-millisecond precision. Nature methods 7 , 399–405 (2010).
- 6[6] Theis, L. et al. Supervised learning sets benchmark for robust spike detection from calcium imaging signals. Ar Xiv e-prints (2015). 1503.00135 .
- 7[7] Pnevmatikakis, E. A., Merel, J., Pakman, A. & Paninski, L. Bayesian spike inference from calcium imaging data. In 2013 Asilomar Conference on Signals, Systems and Computers , 349–353 (IEEE, 2013).
- 8[8] Pachitariu, M. et al. Extracting regions of interest from biological images with convolutional sparse block coding. In Advances in Neural Information Processing Systems , 1745–1753 (2013).
