HNCcorr: A Novel Combinatorial Approach for Cell Identification in   Calcium-Imaging Movies

Quico Spaen; Dorit S. Hochbaum; Roberto As\'in-Ach\'a

arXiv:1703.01999·q-bio.QM·June 3, 2019

HNCcorr: A Novel Combinatorial Approach for Cell Identification in Calcium-Imaging Movies

Quico Spaen, Dorit S. Hochbaum, Roberto As\'in-Ach\'a

PDF

2 Repos

TL;DR

HNCcorr is a new combinatorial algorithm for cell detection in calcium imaging movies that outperforms existing methods by providing optimal solutions through correlation-based grouping.

Contribution

The paper introduces HNCcorr, a novel algorithm based on combinatorial optimization for improved cell identification in calcium imaging data.

Findings

01

Achieves best results on Neurofinder benchmark.

02

Guarantees an optimal solution to the cell detection problem.

03

Provides transparent mapping from data to identified cells.

Abstract

Calcium imaging has emerged as a workhorse method in neuroscience to investigate patterns of neuronal activity. Instrumentation to acquire calcium imaging movies has rapidly progressed and has become standard across labs. Still, algorithms to automatically detect and extract activity signals from calcium imaging movies are highly variable from~lab~to~lab and more advanced algorithms are continuously being developed. Here we present HNCcorr, a novel algorithm for cell identification in calcium imaging movies based on combinatorial optimization. The algorithm identifies cells by finding distinct groups of highly similar pixels in correlation space, where a pixel is represented by the vector of correlations to a set of other pixels. The HNCcorr algorithm achieves the best known results for the cell identification benchmark of Neurofinder, and guarantees an optimal solution to the…

Figures20

Click any figure to enlarge with its caption.

Tables4

Table 1. Table 1 : Overview of the specific properties used in each matrix factorization algorithm. See corresponding papers for more detail.

Paper	Assumptions on temporal component matrix $C$	Assumptions on background matrix $B$	Objective norm	Regularization	Initialization
Maruyama et al [11]	None	$b \times f$ - Rank 1 matrix	$ℓ_{2}$ norm	None	Random
Pnevmatikakis et al. [12, 13]	AR( $p$ ) process with spikes	$b 1^{T}$ - Constant for each pixel	$ℓ_{2}$ norm	Bayesian prior on sparsity of spikes and nuclear norm penalty on sparsity of $A$	Clustering method
Diego-Andilla & Hamprecht [14]	Sparse low-rank matrix	$b \times f$ - Rank 1 matrix	$ℓ_{2}$ norm	Total variation norm on the spatial component of the background $b$	Random
Pnevmatikakis et al. [25, 15]	AR( $p$ ) process with spikes	$b \times f$ - Rank 1 matrix	$ℓ_{1}$ norm	Constraints on variance of noise vector	Greedy algorithm
Pachitariu et al. [16]	None	Set of isotropic 2d raised cosine functions	$ℓ_{2}$ norm	None	Not specified

Table 2. Table 2: Characteristics of the test datasets of the Neurofinder benchmark.

Dataset name	Source of dataset	Length [s]	Frequency [hz]	Brain region	Reference technique
00.00.test	Svoboda Lab	438	7.00	vS1	Anatomical markers
00.01.test	Svoboda Lab	458	7.00	vS1	Anatomical markers
01.00.test	Hausser Lab	300	7.50	v1	Human labeling
01.01.test	Hausser Lab	667	7.50	v1	Human labeling
02.00.test	Svoboda Lab	1000	8.00	vS1	Human labeling
02.01.test	Svoboda Lab	1000	8.00	vS1	Human labeling
03.00.test	Losonczy Lab	300	7.50	dHPC CA1	Human labeling
04.00.test	Harvey Lab	444	6.75	PPC	Human labeling
04.01.test	Harvey Lab	1000	3.00	PPC	Human labeling

Table 3. Table 4 : Experimental performance of the HNCcorr, CNMF, and Suite2P algorithms on all Neurofinder testing datasets. The data is taken from Neurofinder submissions Suite2P by marius10p , CNMF_PYTHON by CNMF , and submission HNCcorr by HNC .

Metric	Algorithm	Datasets
Metric	Algorithm	00.00	00.01	01.00	01.01	02.00	02.01	03.00	04.00	04.01
F1	HNCcorr	29	33	53	56	75	68	23	38	68
	CNMF	28	35	48	47	62	54	28	39	62
	Suite2P	32	39	49	39	60	52	23	47	66
Recall	HNCcorr	19	25	39	42	94	89	15	50	58
	CNMF	18	24	37	36	75	67	20	45	59
	Suite2P	20	27	39	38	83	72	15	55	58
Precision	HNCcorr	62	52	80	84	62	55	47	31	83
	CNMF	69	65	69	67	53	45	49	35	65
	Suite2P	80	72	68	40	47	40	57	41	76
Inclusion	HNCcorr	63	65	58	56	78	76	55	68	74
	CNMF	73	79	56	51	79	80	64	70	77
	Suite2P	72	78	56	45	78	77	78	68	86
Exclusion	HNCcorr	69	65	93	94	84	85	35	90	87
	CNMF	66	68	96	97	87	87	43	90	88
	Suite2P	65	64	92	95	81	82	38	92	83

Table 4. Table 5 : Experimental performance on all testing datasets of the three leading submissions of the Neurofinder benchmark as of January 2017. Both the Suite2P and the Sourcery submission use the Donuts algorithm [ 8 ] for the 00 and 03 dataset series and a version of the Suite2P algorithm [ 16 ] .

Metric	Algorithm	Datasets
Metric	Algorithm	00.00	00.01	01.00	01.01	02.00	02.01	03.00	04.00	04.01
F1	HNCcorr + conv2d	55	61	53	56	75	68	82	38	68
	Sourcery	45	53	62	45	72	56	84	39	69
	Suite2P + Donuts	45	53	49	39	60	52	84	47	66
Recall	HNCcorr + conv2d	43	51	39	42	94	89	76	50	58
	Sourcery	37	44	52	68	85	81	85	57	57
	Suite2P + Donuts	37	44	39	38	83	72	85	55	58
Precision	HNCcorr + conv2d	75	76	80	84	62	55	86	31	83
	Sourcery	58	67	77	34	62	43	83	29	86
	Suite2P + Donuts	58	67	68	40	47	83	57	41	76
Inclusion	HNCcorr + conv2d	55	56	58	56	78	76	75	68	74
	Sourcery	98	98	81	72	96	95	100	92	98
	Suite2P + Donuts	98	98	56	45	78	77	100	68	86
Exclusion	HNCcorr + conv2d	78	77	93	94	84	79	35	90	87
	Sourcery	36	35	81	84	56	58	31	73	63
	Suite2P + Donuts	36	35	92	95	81	82	31	92	83

Equations6

F = A C + B + i.i.d. noise

F = A C + B + i.i.d. noise

\emptyset \subset S \subset V min [i, j] \in E, i \in S, j \in \overset{ˉ}{S} \sum w_{ij} Inter-similarity - λ [i, j] \in E, i \in S, j \in S \sum w_{ij} Intra-similarity .

\emptyset \subset S \subset V min [i, j] \in E, i \in S, j \in \overset{ˉ}{S} \sum w_{ij} Inter-similarity - λ [i, j] \in E, i \in S, j \in S \sum w_{ij} Intra-similarity .

w_{ij} := exp (- \frac{∥ R _{i} - R _{j} ∥ _{2}^{2}}{σ ^{2}})

w_{ij} := exp (- \frac{∥ R _{i} - R _{j} ∥ _{2}^{2}}{σ ^{2}})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

HNCcorr: A Novel Combinatorial Approach for Cell Identification in

Calcium-Imaging Movies

Quico Spaen1, Dorit S. Hochbaum1, Roberto Asín-Achá2

1

Department of Industrial Engineering & Operations Research, University of California, Berkeley, Etcheverry Hall, CA 94720, USA

2

Department of Computer Science, Universidad de Concepción, Concepción, Chile

Abstract

Calcium imaging has emerged as a workhorse method in neuroscience to investigate patterns of neuronal activity. Instrumentation to acquire calcium imaging movies has rapidly progressed and has become standard across labs. Still, algorithms to automatically detect and extract activity signals from calcium imaging movies are highly variable from lab to lab and more advanced algorithms are continuously being developed. Here we present HNCcorr, a novel algorithm for cell identification in calcium imaging movies based on combinatorial optimization. The algorithm identifies cells by finding distinct groups of highly similar pixels in correlation space, where a pixel is represented by the vector of correlations to a set of other pixels. The HNCcorr algorithm achieves the best known results for the cell identification benchmark of Neurofinder, and guarantees an optimal solution to the underlying deterministic optimization model resulting in a transparent mapping from input data to outcome.

Introduction

In the past decade new methods to measure neuronal activity have emerged based on optical imaging of activity-dependent sensors [1, 2]. In particular, optical imaging using genetically-encoded calcium indicators, known as calcium imaging, has become a key tool for circuits and systems neuroscience and is currently used by hundreds of labs across the world [1]. While instrumentation and the calcium indicators have matured rapidly, algorithms to automatically detect and extract activity signals from calcium imaging movies are highly variable from lab to lab. More advanced algorithms are continuously being developed. As a result, the analysis of calcium imaging movies is non-standardized and requires extensive human supervision [1, 3].

Calcium imaging movies collected during an experiment require post-processing to extract the fluorescence signals of the individual cells in the movie. This process consists of two steps. First, in the cell identification problem, the goal is to identify the spatial footprint of the cells in the movie. The spatial footprint of a cell is the set of pixels that contain this cell. Then, in the signal extraction step, the fluorescence signal of each cell is extracted based on the intensities of the pixels in the spatial footprint of the cell. For the signal extraction step a number of effective algorithms have been proposed [4, 5, 6, 7]. For the cell identification problem, in addition to manual identification of the cells, two types of automated procedures have been proposed: Shape-based segmentation algorithms [8, 9], and signal-based algorithms. The signal-based algorithm use one of two techniques: Matrix factorization [10, 11, 12, 13, 14, 15, 16] or graph partitioning [17]. See the literature review for a detailed discussion of these methods.

In this paper we present HNCcorr, a new signal-based method for cell identification in calcium imaging movies. Our method is novel in that it identifies the spatial footprint of a cell by mapping the pixels to correlation space, where a pixel is represented by the vector of correlations to a set of other pixels, and the use of a graph-partitioning model that can be solved optimally. In contrast, all other techniques rely on heuristics that do not guarantee optimal solutions for the underlying models. By mapping the pixels to correlation space, the pixels of each cell group together and also the background pixels congregate. HNCcorr then uses the HNC model [18, 19] to identify cells one at a time by finding a set of pixels that are close in correlation space, but whose correlation patterns are distinct from the rest of the pixels. This type of model originates in image segmentation and is also effective in general data mining [18, 19, 20, 21, 22, 23]. This model is solved optimally with a fast combinatorial optimization algorithm. A Python implementation of HNCcorr is available on GitHub: https://github.com/quic0/HNCcorr.

Experimentally, the HNCcorr outperforms other state-of-the-art approaches in the Neurofinder challenge [24]. It is able to accurately detect more cells than competing algorithms. When combined with the shape-based algorithm Donuts [8] for detecting cells without a signal, HNCcorr achieves the best known average score for the Neurofinder benchmark as of January 2017.

Literature Review

The known approaches for the cell identification problem can be classified into two different categories: Shape-based algorithms and signal-based algorithms. The algorithms in the first category identify the cell bodies by looking for the shape of a cell in an input image, which is typically the image containing the average intensity of each pixel over time. The second category of algorithms identifies cells by identifying clusters of pixels with a similar fluorescence signal over time. The signal-based algorithms can be split into two types of techniques: Matrix factorization and graph partitioning. Next, each of these techniques is discussed in detail.

Shape-based identification:

Pachitariu et al. [8] propose a generative model that aims to capture regularities in an image composed of many elements of few different types. The model assumes that a shape (i.e. a cell body) is built from a set of learned features (e.g. edges, curved edges, corners, etc). Based on the learned features, the model can predict the probability that a group of pixels represents a cell body. The algorithm Donuts uses these learned features to identify a set of shapes that should correspond with cell bodies.

More recently, an algorithm named Conv2d that applies convolutional neural networks in a two-dimensional image has been proposed [9]. The network is trained on data where the cell bodies have been identified. Based on this trained network, it predicts the probability that each pixel belongs to a cell and constructs the spatial footprints of the cells.

Both of these approaches tend to work well whenever the cell body is easily distinguishable in the input image. However, cells that are not clearly visible in this image cannot be found. In practice, due to the acquisition technique, many cells are only visible in the small number of frames where they are activated. These cells do not show in the image containing the average intensity of each pixel and can thus not be identified. Furthermore, these techniques are also not well-suited for detecting overlapping cells, since these cells can only be separated based on their signals.

Matrix factorization:

Matrix factorization is a set of techniques that factorize a matrix as a product of multiple low-dimensional matrices. These techniques provide the spatial footprint and the signals of the cells. Mukamel et al. [10] was the first paper that introduced the idea that the intensity of a pixel over time is a noisy measurement of a mixture of signals. The movie can be split into a set of signals, each having a spatial and a temporal component. The spatial component measures the extent to which a signal contributes to the intensity of each of the pixels and the temporal component represents the value of the signal over time. To extract the cells’ signals from the movie, the method first uses Principal Component Analysis (PCA) to reduce the dimensions of the data and noise. Then, the signals are identified using spatio-temporal independent component analysis (ICA).

In recent work [14, 11, 12], it was recognized that this idea can be formalized as matrix factorization. They consider the following generative model for the signal of the intensities of the pixels:

[TABLE]

where $F$ is a non-negative matrix containing the fluorescence of all pixels over time, $A$ is a non-negative matrix that measures the contribution of each cell’s signal to the intensity of each pixel, $C$ is a matrix that captures the signal of the cells at each time step, and $B$ is a background signal. The matrix $F$ is known and we wish to infer the values of the matrices $A,C,B$ , where $A$ and $C$ respectively capture the spatial and temporal components of each of the cells. The matrices are inferred by minimizing the objective function $\|F-AC-B\|^{2}+\Omega(A,B,C)$ , where $\|\cdot\|$ is a matrix norm and $\Omega(A,B,C)$ is a regularization function on the matrices $A$ , $B$ , and/or $C$ . An iterative procedure known as coordinate descent (a special case is Alternating Least Squares) is applied to this problem. Starting from an initial solution, the iterative algorithm alternates between fixing $A$ and optimizing over $C$ and fixing $C$ and optimizing over $A$ until the solution converges. All of the papers listed above present a variant of this approach. Table 1 summarizes their differences.

Although these techniques claim adequate results, the models and their solution techniques suffer from major shortcomings. The optimization problem of matrix factorization is inherently non-convex and computationally intractable, which means it cannot be solved optimally. Instead, the associated algorithms are heuristics that only find a local minimum close to the initial solution. As a result, the quality of the solution is highly dependent on the initial solution and may not generalize across datasets.

Furthermore, these algorithms are based on a set of assumptions on the dynamics of the fluorescent calcium indicator. These assumptions do not necessarily capture the actual dynamics. In contrast, the HNCcorr algorithm makes no specific assumptions about the underlying dynamics of the fluorescent indicator.

Graph partitioning:

This group of techniques identifies cells by partitioning a graph in which the nodes correspond to the pixels and the edges measures the similarity between pairs of pixels. In [17], Kaifosh et al. propose a model in which they find the ROIs, consisting of one or more cells, by repeatedly subdividing the set of pixels into coherent groups. To do this, they use an eigenvector-based heuristic for an optimization problem known as Normalized Cut (NC) [26]. This approach suffers from two major drawbacks. First, the Normalized Cut model is not appropriate for this problem (see the section on methodology for details). Second, they rely on a heuristic for solving the optimization problem, which may lead to suboptimal results.

The HNCcorr algorithm is similar to the algorithm of [17] in that it partitions the pixels based on a similarity graph. However, the algorithm uses both a different graph and a different optimization model. As a result, it is not affected by the shortcomings listed above.

In contrast to the existing approaches, the method proposed here solves a discrete optimization problem and guarantees an optimal solution. The algorithm is robust in that it does not rely on initialization and requires minimal parameter tuning. Furthermore, our algorithm uses the novel idea to identify cell bodies based on the similarity between pixels in correlation space.

Methodology

HNCcorr aggregates a set of pixels in a cluster so they are highly similar to each other, and highly non-similar to the pixels not in the cluster. These clusters form the spatial footprints of cells. HNCcorr identifies these clusters one at a time. It is noted that the intensity of a pixel in the spatial footprint of a cell is a noisy observation of the signal of the cell over time. This implies that pixels in the spatial footprint of a cell should have the same intensity pattern over time, and that the intensities of these pixels should be highly correlated. To that end, we associate with each pixel the vector of signal correlations with other pixels in the window, referred to as the correlation space. The distances between those vectors are used in HNCcorr as the similarity measure for the clustering problem. Next, we first describe the core components of the HNCcorr algorithm in detail.

Clustering pixels with HNC

The pixels in the spatial footprint of a cell have highly similar signals. We use the HNC model [18, 19] to identify the clusters of pixels in a graph. This graph is an undirected graph $G=(V,E)$ where the node set $V$ represents the pixels and the edges in set $E$ represent similarity relations between pixels. With each edge $[i,j]\in E$ , we associate a similarity weight $w_{ij}\in[0,1]$ where $w_{ij}$ is close to $1$ if pixels $i,j$ are highly similar and close to [math] if they are not.

The goal in the HNC model is to find a partition of the nodes (pixels) into the set $S$ , which represents the spatial footprint of a single cell, and the set $\bar{S}=V\setminus S$ , which are the remaining pixels, so that the similarity between pixels in $S$ and the pixels in $\bar{S}$ is as low as possible, low inter-similarity, and the pixels in the spatial footprint $S$ are highly similar to each other, high intra-similarity. This is expresses as the following optimization problem:

[TABLE]

The parameter $\lambda\geq 0$ determines the tradeoff between the inter-similarity and the intra-similarity. Note that minimizing the negative of the intra-similarity term is equivalent to maximizing it.

HNC is closely related to a well-known optimization problem named Normalized Cut (NC) from the field of image segmentation [26, 27]. NC has been previously used for calcium imaging movies to find regions that contain one or more cells by Kaifosh et al. [17]. NC is not an appropriate model for the cell identification problem because it would try to assure that the remaining pixels (i.e. those in $\bar{S}$ ) are also highly similar. Furthermore, NC is known to be NP-hard which means that it is computationally intractable. In practice, it is solved with a heuristic based on eigenvectors [26].

In contrast, very efficient algorithms exist to optimally solve the HNC problem. These algorithms [18, 19] arise from reducing the HNC problem to the well-known minimum cut/maximum flow problem. These algorithms are especially powerful in that they guarantee to find an optimal set $S$ . This is a major advantage since it creates a unique and transparent mapping from input to output. As a result, it is straightforward to understand how changing the input affects the output of the algorithm. Furthermore, we can use parametric cut/flow algorithms for the minimum cut problem [28, 29] that find the optimal solutions for all values of $\lambda$ at once, removing the need for tuning the $\lambda$ parameter.

Seed and window selection

At each iteration the goal is to segment a single cell not previously identified. To ensure that the model identifies the correct cell, HNC needs as input a small set of representative pixels, seeds, for $S$ and $\bar{S}$ . For $S$ we select as seed a superpixel, a small square of $k\times k$ pixels where $k\leq 5$ . Note that when $k=1$ a superpixel is a singleton pixel. We refer to the selected superpixel for $S$ as the seed.

For $\bar{S}$ we select a small set of pixels, referred to as the negative seeds. These pixels are picked uniformly from a circle of a sufficiently large radius centered at the seed, such that the cell to be identified, if exists, can be assumed to be contained in the circle.

For each cell to be segmented, only the pixels within a certain vicinity of the seed need to be considered. For that purpose a sufficiently large window is constructed around the seed. The window is an $n_{1}\times n_{2}$ rectangular subregion of the movie that contains the seed at its center. The HNC clustering problem is restricted to the graph defined on the pixels in this window.

To find all cells, one can simply try all possible (super)pixels of a given size as seeds. Since the procedure for identifying a single cell is computationally very efficient, this exhaustive search is possible. Nevertheless, this procedure performs a significant amount of unnecessary computation. For example, selecting two seeds close to each other is likely to give the same outcome. Hence, it is sufficient to consider at most a few pixels per small subregion of the movie. Furthermore, a good indication of whether the seed is part of a cell is to compute the average correlation between the seed and the surrounding neighborhood of e.g. $5\times 5$ pixels. In most cases, it would be sufficient to try only the $30-40$ percent of pixels with the highest average correlation as seeds to detect nearly all of the active cells. Finally, pixels that are part of a previously found cell should be excluded to prevent the same cell from being segmented multiple times.

Measuring similarity: Correlation space

A standard technique for measuring similarity between signals is to measure the correlation between them. However, this is not an appropriate measure for the similarity measure $w_{ij}$ . The problem with correlation is that pixels that are not part of any cell are not sufficiently correlated with each other. As a result, these pixels are as likely to be clustered in $S$ (the spatial footprint) as in $\bar{S}$ . To overcome this, we introduce correlation space.

In correlation space, a pixel is represented by a vector containing the correlations between the pixel’s intensity timeseries and the timeseries of all pixels in the window. That is, every pixel $i$ is described by a vector of correlations $R_{i}\in[-1,1]^{n}$ , where $n=n_{1}\times n_{2}$ is the number of pixels in the window. The pixels that are part of a spatial footprint will have high values for the entries that correspond to pixels that are in the spatial footprint and low values for the entries of pixels that are not in the spatial footprint. As a result, the pixels in each spatial footprint as well as the background pixels cluster in correlation space. See Figures 1 and 2 for visualizations of this phenomenon.

To define $w_{ij}$ we measure the Gaussian similarity between pixels $i$ and $j$ in correlation space. Precisely, we define $w_{ij}$ as:

[TABLE]

where $\sigma$ is a scaling parameter that is typically set to $1$ . Note that $w_{ij}$ is close to one when the distance between pixels $i$ and $j$ in correlation space is small and near zero when the distance is large.

Sparse Computation

For the edge set of the graph $G=(V,E)$ on which we solve the HNC model, one option is to naively select $E$ to consist of all possible pairs of pixels. However, this can be computationally costly and unnecessary since most pairs of pixels are highly dissimilar. Instead, we use a methodology called sparse computation [22, 23]. Sparse computation allows us to compute only a small subset of all possible pairs. It uses the observation that for many pairs the pairwise similarity is close to zero and that removing these edges does typically not affect the outcome. For general machine learning, sparse computation significantly reduces the running time of similarity-based classifiers [22, 23], such as SNC [30], KNN, and kernel SVM. In sparse computation, the data, here consisting of the correlation vectors, is efficiently projected onto a low-dimensional space of dimension $p$ , typically using a fast approximation of principal component analysis. The low-dimensional space is then subdivided into $\kappa$ sections per dimension resulting in $\kappa^{p}$ grid blocks. We add to the edge set $E$ edges between pixels in the same or neighboring grid blocks.

HNCcorr algorithm

Having introduced the key ideas behind HNCcorr, we now describe the HNCcorr algorithm.

The HNCcorr algorithm iterates through a list of seeds. For each seed, the algorithm performs the steps described below to identify a cell whose spatial footprint contains the seed or to conclude that no such cell exists. The algorithm first constructs a window around the seed and picks the negative seeds. Next, the algorithm maps each pixel to correlation space. It then determines which edges are in the graph with the sparse computation technique. For each of these edges, $w_{ij}$ is computed. Now that the graph is well-defined, the algorithm calls a subroutine to solve the corresponding HNC problem for all $\lambda\geq 0$ . We can obtain an optimal set $S(\lambda)$ for each value of $\lambda$ , since the sets $S(\lambda)$ are nested ( $\lambda_{1}<\lambda_{2}$ , $S(\lambda_{1})\subseteq S(\lambda_{2})$ ) and therefore there can be at most $n$ such sets.

Finally, a simple oracle decides the best spatial footprint for the cell based on these sets $S(\lambda)$ or concludes that there is no cell located at the seed. Currently, the oracle simply checks if the number of segmented pixels is between a given lower and upper bound based on the expected size of a cell. In case none of the segmentations satisfy this criteria, then the oracle concludes that no cell containing the input seed exists. In case more than one segmentation satisfies the size criteria, the oracle returns the spatial footprint where the number of segmented pixels is closest to a given expected cell size. Note that more complex oracles can be used as well.

The output of the algorithm is either the best segmentation or an empty set indicating that no cell has been identified. This process repeats for each of the selected seeds.

A visualization of a segmentation for a given seed is given in Figure 3. Figure 4 shows an example of the output of the HNCcorr algorithm after processing all seeds in a calcium imaging movie.

Experimental Results

To evaluate the experimental performance of HNCcorr in detecting cells in calcium imaging movies, we compare against the best known algorithms to date. Specifically, we test HNCcorr as well as the matrix factorization algorithms Suite2P [16] and CNMF [15] on the datasets of the Neurofinder public benchmark [24]. The Neurofinder benchmark is an initiative of the CodeNeuro collective of neuroscientists that encourages software tool development for neuroscience research. Based on the Neurofinder benchmark, we show that HNCcorr outperforms the other algorithms with a relative improvement in average F1-score across datasets of at least 14 percent.

The Neurofinder benchmark consists of nine testing datasets and nineteen training datasets. Each dataset consists of a calcium imaging movie. For the training datasets, a reference list of the spatial footprints of the cells in the movie, is provided with the dataset. For the testing datasets, the reference list (i.e. the spatial footprints of the cells) is not disclosed. The datasets have been contributed by various labs and have been recorded under different experimental conditions. They differ in sample frequency, length of the movie, magnification, signal-to-noise ratio, region of the brain that was recorded, and how the footprints in the datasets were determined. Table 2 contains the characteristics of the testing datasets.

The datasets can be split into two groups based on the activity of the cells. The reference list for the 00 and 03 dataset series contains many cells that have a weak or non-existent signal, whereas most cells have a detectable signal in the 01, 02, and 04 dataset series.As a result, algorithms based on signal detection are expected to perform poorly on the 00 and 03 dataset series.

To evaluate an algorithm on the Neurofinder benchmark, the algorithm should identify the cells in each testing dataset. The spatial footprints of the cells as detected are then submitted to Neurofinder. The submission is then automatically evaluated by comparing it against the reference list of spatial footprints provided by the dataset contributors. Since these lists are not publicly disclosed for the testing datasets, this guarantees an unbiased and fair evaluation. Furthermore, this procedure removes the need for replicating the results of other algorithms.

A submission to Neurofinder is evaluated for each testing dataset based on metrics for detection quality and metrics for the segmentation quality111See the supplementary material for results on detection quality.. To measure the detection quality of the algorithm we consider the following metrics, [24]:

Recall

The percentage of the cells in the undisclosed reference list of spatial footprints that are recovered by the algorithm.

Precision

The percentage of the cells identified by the algorithm that are also present in the undisclosed reference list.

F1-score

The harmonic mean of precision and recall. This is a standard metric in machine learning for evaluating the overall performance of an algorithm.

For the recall and precision metrics, a cell is considered identified if the center of mass of the spatial footprint as determined by the algorithm is within five pixels of the center of mass of the spatial footprint in the reference list.

The experimental performance of the signal-based algorithms HNCcorr, CNMF, and Suite2P on the 01, 02, 04 dataset series is shown in Figure 5. The implementation details for the HNCcorr submission are described in the supplementary material. Since the 00 and 03 dataset series are not appropriate for signal-based algorithms, these dataset series have been excluded. Results on all testing datasets are provided in the supplementary material.

Overall, the HNCcorr algorithm has superior performance across the datasets. The HNCcorr achieves a 14 percent relative improvement in average F1-score compared to the Suite2P and CNMF algorithm. The HNCcorr algorithm also achieves the highest F1-score for each of the datasets with the exception of the dataset 04.00. This improvement is mainly due to an increase in precision across all datasets except 04.00. All of the algorithms have a significantly lower precision for the 04.00 dataset as compared to the other datasets. Based on analysis of the 04.00 training dataset, a possible explanation is that some cells with signal are identified by the algorithms but these cells are not in the reference listing. Furthermore, the HNCcorr algorithm achieves a significant improvement in recall for the 02.00 and 02.01 datasets, resulting in near-perfect recall.

A common strategy for the Neurofinder benchmark is to submit an ensemble of algorithms to mitigate the poor performance of the signal-based algorithms on the 00 and 03 dataset series. Typically, the ensemble consists of a shape-based detection for the 00 and 03 series and a signal-based algorithm for the 01, 02, and 04 dataset series. In Figure 6, we compare the results of HNCcorr combined with conv2d with two other ensembles of algorithms based on Suite2P. These three submissions are leading on the Neurofinder leaderboard as of January 2017, and the ensemble of the HNCcorr algorithm and the conv2d shape-based algorithm achieves the highest score for the Neurofinder benchmark.

Discussion

The algorithm HNCcorr is a new algorithm for cell identification in calcium imaging movies. In contrast to previous methods, HNCcorr is not based on matrix factorization. Instead, the HNCcorr algorithm combines graph clustering based on combinatorial optimization with the use of correlations. Here we discuss several characteristics of HNCcorr.

First, the distinguishing feature of HNCcorr is that its underlying optimization model, the HNC model, can be solved optimally with a combinatorial optimization algorithm. To the best of our knowledge, all other techniques rely on heuristics for finding a solution to their model. Solving a model optimally is important because the solution is fully determined by the model and does not depend on the solution technique. This provides a transparent mapping from the model and the input data to the outcome. As a result, unsatisfactory results can therefore be analyzed solely based on the model, without having to consider any uncertainty introduced by the solution technique.

Second, the HNCcorr algorithm has a different computational structure. The key difference compared to other algorithms is that the algorithm detects the cell bodies of the neurons one at a time. An immediate advantage is that multiple seeds can be evaluated simultaneously. This allows for efficient parallelization of the algorithm. Another characteristic of HNCcorr is that it has an oracle that filters segmentations that are not cells. This is used since HNCcorr attempts to identify a cell based on any seed. As demonstrated, a simple oracle based on the size of the segmentation is sufficient to attain state-of-the-art precision. We believe that a more advanced oracle could significantly improve the precision of the algorithm, and we plan to investigate this further. Another observation is that the HNCcorr algorithm scales linearly in the length of the movie whereas matrix factorization algorithms do not.

A critical parameter for the HNCcorr algorithm is the size of the window. The window size sets the dimension of the correlation space and determines the computational cost of mapping the pixels to correlation space. If this becomes a significant computational issue, then we plan to explore the following ideas: One is to limit the computation of correlation only to the relevant similarities and set the remaining correlations to zero. For example, one could use an unbiased estimator for the correlation based on a random subset of the frames and compute only those correlation that are significantly different from zero. Another option is to limit the dimension of the correlation space by computing only the correlations with respect to a (possibly random) subset of the pixels.

An often stated advantage of matrix factorization based algorithms is that these algorithms can identify overlapping cells based on their signal. Although the percentage of cells that overlap is typically small, the HNCcorr algorithm is also able to identify overlapping cells, see Figure 7 for an example. The key observation here is that pixels can be segmented more than once and, thus, belong to multiple cells.

Finally, a novel feature of HNCcorr is the use of correlation space. Through the mapping to correlation space, HNCcorr is the first algorithm that employs correlation to attain state-of-the-art results. As previously discussed in Figures 1 and 2, the success of employing correlation space is that closeness in this space is a strong indicator of whether pixels correspond to the same cells or background.

Conclusion

In this paper we present HNCcorr for identifying cells in calcium imaging movies. HNCcorr segments cells one at a time by finding distinct sets of pixels that are nearly identical in correlation space. The algorithm is based on a combinatorial optimization problem known as HNC that can be solved efficiently. As a result, the algorithm guarantees to find an optimal solution to the underlying optimization problem. Experimentally, HNCcorr has superior performance compared to matrix factorization based algorithms that have been considered state-of-the-art. For cells segmented based on activity, HNCcorr achieves a relative improvement of at least 14 percent in average F1-score compared to CNMF and Suite2P. Combined with the Conv2d shape-based detection algorithm, HNCcorr achieves the best known results to date in the Neurofinder benchmark.

For further research, we plan to adapt HNCcorr to calcium imaging movies collected with light-field microscopy [31] as well as for the detection of subcellular components (e.g. dendrites). We also plan to explore a more powerful oracle that decides whether a segmentation corresponds to a cell which could potentially lead to a significant improvement in the performance of the algorithm. Finally, we plan to investigate other domains where HNCcorr could be of value.

Acknowledgments

The authors would like to thank Christopher Harvey, Selmaan Chettih, and Matthias Minderer from Harvard Medical School for their guidance, the data they provided, and for the valuable discussions we have had.

Supplementary Material

Implementation details

HNCcorr was implemented in Matlab R2015a. Here we describe the relevant implementation details.

A new Python version of the code has been made available on GitHub: https://github.com/quic0/HNCcorr

Preprocessing:

The datasets were preprocessed by averaging every ten frames into a single frame to reduce the noise. Each preprocessed dataset is stored in HDF5 format for efficient data access.

Seed selection:

To identify possibly effective seeds, we use the procedure described in Algorithm 1.

Parameters settings:

For sparse computation we use three dimensions ( $p=3$ ) for the low-dimensional space and a grid resolution of $\kappa=25$ . For the negative seeds, we select a set of nine pixels uniformly distributed from a large circle.

The remaining parameters depend on the dataset and are defined as follows:

[TABLE]

Neurofinder Benchmark results

The tables below show the performance of the algorithms on all datasets for both the cell detection and segmentation quality metrics. The metrics for cell detection have been defined in the experimental results section. To measure the segmentation quality, the Neurofinder benchmark uses the metrics inclusion and exclusion [24].

Metrics segmentation quality

Let $A$ be the set of pixels in the spatial footprint of a cell for the reference labeling and $B$ be the set of pixels of the spatial footprint of a cell for the algorithm. Then, the metrics inclusion and exclusion are defined as follows:

Inclusion

Average of $\frac{\left|A\cap B\right|}{\left|A\right|}$ across all cells correctly identified by the algorithm.

Exclusion

Average of $\frac{\left|A\cap B\right|}{\left|B\right|}$ across all cells correctly identified by the algorithm.

These metrics are of lower importance since the critical task is finding the location of a cell. It may still be possible to extract the signal of the cell even if the shape is not detected perfectly. Also, the spatial footprint of the cell in the reference labeling is not always fine-tuned.

Results

For each of the listed metrics, higher scores are better.

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Hamel, E. J., Grewe, B. F., Parker, J. G. & Schnitzer, M. J. Cellular level brain imaging in behaving mammals: an engineering approach. Neuron 86 , 140–159 (2015).
2[2] Scanziani, M. & Häusser, M. Electrophysiology in the age of light. Nature 461 , 930–939 (2009).
3[3] Peron, S., Chen, T.-W. & Svoboda, K. Comprehensive imaging of cortical networks. Current opinion in neurobiology 32 , 115–123 (2015).
4[4] Vogelstein, J. T. et al. Fast nonnegative deconvolution for spike train inference from population calcium imaging. Journal of neurophysiology 104 , 3691–3704 (2010).
5[5] Grewe, B. F., Langer, D., Kasper, H., Kampa, B. M. & Helmchen, F. High-speed in vivo calcium imaging reveals neuronal network activity with near-millisecond precision. Nature methods 7 , 399–405 (2010).
6[6] Theis, L. et al. Supervised learning sets benchmark for robust spike detection from calcium imaging signals. Ar Xiv e-prints (2015). 1503.00135 .
7[7] Pnevmatikakis, E. A., Merel, J., Pakman, A. & Paninski, L. Bayesian spike inference from calcium imaging data. In 2013 Asilomar Conference on Signals, Systems and Computers , 349–353 (IEEE, 2013).
8[8] Pachitariu, M. et al. Extracting regions of interest from biological images with convolutional sparse block coding. In Advances in Neural Information Processing Systems , 1745–1753 (2013).