Instance-Level Microtubule Tracking

Samira Masoudi; Afsaneh Razi; Cameron H.G. Wright; Jay C. Gatlin; Ulas; Bagci

arXiv:1901.06006·cs.CV·January 22, 2020

Instance-Level Microtubule Tracking

Samira Masoudi, Afsaneh Razi, Cameron H.G. Wright, Jay C. Gatlin, Ulas, Bagci

PDF

TL;DR

This paper introduces a deep learning method for precise instance-level microtubule tracking in time-lapse images, improving velocity estimation and reducing false negatives by leveraging recurrent attention and temporal data.

Contribution

The novel approach combines segmentation, trajectory assignment, and velocity estimation using recurrent attention, significantly enhancing microtubule tracking accuracy over previous methods.

Findings

01

Velocity estimation accuracy improved to 71.3% from 29.3%.

02

False negative rate reduced from 67.8% to 28.7%.

03

Method validated on real and simulated data.

Abstract

We propose a new method of instance-level microtubule (MT) tracking in time-lapse image series using recurrent attention. Our novel deep learning algorithm segments individual MTs at each frame. Segmentation results from successive frames are used to assign correspondences among MTs. This ultimately generates a distinct path trajectory for each MT through the frames. Based on these trajectories, we estimate MT velocities. To validate our proposed technique, we conduct experiments using real and simulated data. We use statistics derived from real time-lapse series of MT gliding assays to simulate realistic MT time-lapse image series in our simulated data. This dataset is employed as pre-training and hyperparameter optimization for our network before training on the real data. Our experimental results show that the proposed supervised learning algorithm improves the precision for MT…

Tables8

Table 1. TABLE I: Definitions of terms in computing the loss function

$I_{t}$	current frame
$I_{t}^{{AR}_{i}}$	Predicted mask for attention region of instance $i$
$I_{t}^{i}$	Predicted mask of segmented instance $i$
$S^{i}$	Obtained score for segmenting instance $i$
$Y_{t}$	3-D matrix of binary masks at time $t$
$Y_{t}^{{AR}_{j}}$	Binary mask for attention region of instance $j$
$Y_{t}^{j}$	Binary mask of instance $j$
$S^{j}^{*}$	True score

Table 2. TABLE II: Instance-level MT segmentation performance in terms of best Jaccard similarity coefficient ( J ), false positive rate ( FPR ), and false negative rate ( FNR )

	Dataset	J	FPR	FNR
Adaptive Template Matching [10]	Real	0.533	0.596	0.415
PMM Kalman Smoother [10]	Real	0.474	0.343	0.285
Ours (OF, L=5)	Real	0.681	0.455	0.219

Table 3. TABLE III: Instance-level MT velocity estimation performance in terms of best Vsim ( BVs ), false discovery rate ( FDR ), false negative rate ( FNR ), and Difference in Counting ( DiC ).

	Dataset	BVs	FDR	FNR	DiC	${DIC}_{ext}$	${DIC}_{ent}$
Adaptive Template Matching [10]	Real	0.293	0.762	0.678	0.431	0.624	0.829
PMM Kalman Smoother [10]	Real	0.568	0.372	0.391	0.512	0.438	0.329
Ours (OF, L=5)	Real	0.632	0.237	0.287	0.116	0.363	0.313

Table 4. TABLE IV: Instance-level MT velocity estimation using raw frames (threshold=0.23, L = temporal window length).

	Dataset	BVs	FDR	FNR	DiC	${DIC}_{ext}$	${DIC}_{ent}$
L=1	Sim	0.320	0.420	0.230	0.356	0.611	0.278
L=3	Sim	0.465	0.120	0.313	0.273	0.454	0.438
L=5	Sim	0.665	0.001	0.149	0.118	0.413	0.404
L=1	Real	0.249	0.215	0.320	0.563	0.682	0.439
L=3	Real	0.544	0.327	0.319	0.542	0.512	0.418
L=5	Real	0.583	0.321	0.314	0.532	0.463	0.374

Table 5. TABLE V: instance-level MT velocity estimation using OF, (threshold=0.23, L = temporal window length).

	Dataset	BVs	FDR	FNR	DIC	${DIC}_{ext}$	${DIC}_{ent}$
L=3	Sim	0.705	0.023	0.156	0.086	0.319	0.237
L=5	Sim	0.712	0.001	0.083	0.081	0.211	0.186
L=3	Real	0.588	0.268	0.151	0.171	0.390	0.349
L=5	Real	0.632	0.237	0.287	0.116	0.363	0.313

Table 6. TABLE VI: Instance-level MT velocity estimation using OF, for different potential architectures of the visual attention module (threshold=0.23, L=5).

	Dataset	BVs	FDR	FNR	DiC	${DIC}_{ext}$	${DIC}_{ent}$
CNN1 (10-layers)	Sim	0.556	0.159	0.318	0.217	0.511	0.526
CNN1 (15-layers)	Sim	0.564	0.154	0.303	0.221	0.513	0.515
CNN (8-layers)+LSTM	Sim	0.712	0.001	0.083	0.081	0.211	0.186
CNN1 (10-layers)	Real	0.412	0.347	0.452	0.560	0.487	0.432
CNN1 (15-layers)	Real	0.418	0.344	0.439	0.551	0.479	0.415
CNN (8-layers)+LSTM	Real	0.632	0.237	0.287	0.116	0.363	0.313

Table 7. TABLE VII: Average performance of instance-level MT segmentation over 50 frames simulated to contain 30 number of MT instances.

Method	BJc	FPR	FNR
Recurrent Instance Segmentation [27]	0.651	0.097	0.311
Deep Watershed Transform for Instance Segmentation [26]	0.514	0.056	0.402
Ours (OF, L=5)	0.743	0.028	0.069

Table 8. TABLE VIII: Average performance of instance-level MT segmentation (OF, L=5) over 50 frames simulated to contain certain number of MT instances.

MT number	BJc	FPR	FNR
10	0.825	0.012	0.046
20	0.796	0.018	0.065
30	0.743	0.028	0.069
40	0.711	0.055	0.179

Equations48

G_{t}^{k} = {I_{t - L}, ..., I_{t}, ..., I_{t + L}, C_{t}^{k}} .

G_{t}^{k} = {I_{t - L}, ..., I_{t}, ..., I_{t + L}, C_{t}^{k}} .

A^{u} = {1/ (H_{2} \times W_{2}), MLP (z^{u}), if u = 0, otherwise,

A^{u} = {1/ (H_{2} \times W_{2}), MLP (z^{u}), if u = 0, otherwise,

\displaystyle\textbf{z}^{{u}}=\begin{cases}0,&\text{if}\quad{u}=0,\\ \text{LSTM}\bigg{(}\textbf{z}^{{u-1}},\sum\limits_{{h}_{2},{w}_{2}}{\mathbf{A}^{{u-1}}({h}_{2},{w}_{2})\textbf{q}_{{h}_{2},{w}_{2}}}\bigg{)},&\text{otherwise.}\end{cases}

\displaystyle\textbf{z}^{{u}}=\begin{cases}0,&\text{if}\quad{u}=0,\\ \text{LSTM}\bigg{(}\textbf{z}^{{u-1}},\sum\limits_{{h}_{2},{w}_{2}}{\mathbf{A}^{{u-1}}({h}_{2},{w}_{2})\textbf{q}_{{h}_{2},{w}_{2}}}\bigg{)},&\text{otherwise.}\end{cases}

[μ_{x}, μ_{y}, σ_{x}, σ_{y}]^{⊺} = W_{b} z^{U} + w_{b 0} .

[μ_{x}, μ_{y}, σ_{x}, σ_{y}]^{⊺} = W_{b} z^{U} + w_{b 0} .

F_{x} (h_{1}, h_{3}) = \frac{1}{σ _{x} 2 π} exp - \frac{( h _{1} - μ _{x} ) ^{2}}{2 σ _{x} ^{2}},

F_{x} (h_{1}, h_{3}) = \frac{1}{σ _{x} 2 π} exp - \frac{( h _{1} - μ _{x} ) ^{2}}{2 σ _{x} ^{2}},

F_{y} (w_{1}, w_{3}) = \frac{1}{σ _{y} 2 π} exp - \frac{( w _{1} - μ _{y} ) ^{2}}{2 σ _{y} ^{2}} .

P = F_{x}^{⊤} G_{t}^{k} F_{y} .

P = F_{x}^{⊤} G_{t}^{k} F_{y} .

\displaystyle=\text{Encoder}\big{(}\textbf{P}\big{)},

\displaystyle=\text{Encoder}\big{(}\textbf{P}\big{)},

\hat{P}

I_{t}^{k}

C_{t}^{k} = ⎩ ⎨ ⎧ 0, \frac{1}{k - 1} j = 1 \sum j = k - 1 I_{t}^{j} if k = 1, if k > 1.

C_{t}^{k} = ⎩ ⎨ ⎧ 0, \frac{1}{k - 1} j = 1 \sum j = k - 1 I_{t}^{j} if k = 1, if k > 1.

M=\max\limits_{1\leq t\leq T}\big{(}{n}_{t}\big{)},

M=\max\limits_{1\leq t\leq T}\big{(}{n}_{t}\big{)},

f (A, B) = \frac{\sum ( A \circ B )}{\sum ( A + B - A \circ B )},

f (A, B) = \frac{\sum ( A \circ B )}{\sum ( A + B - A \circ B )},

L = L_{att} + L_{seg} + L_{count} .

L = L_{att} + L_{seg} + L_{count} .

L_{att} (I_{t}, Y_{t}) = - \frac{1}{m _{t}} i, j \sum l_{att}^{i, j},

L_{att} (I_{t}, Y_{t}) = - \frac{1}{m _{t}} i, j \sum l_{att}^{i, j},

l_{att}^{i, j} = {f (I_{t}^{AR_{i}}, Y_{t}^{AR_{j}}), 0, if i matches j, otherwise.

l_{att}^{i, j} = {f (I_{t}^{AR_{i}}, Y_{t}^{AR_{j}}), 0, if i matches j, otherwise.

L_{seg} (I_{t}, Y_{t}) = - \frac{1}{m _{t}} i, j \sum l_{seg}^{i, j},

L_{seg} (I_{t}, Y_{t}) = - \frac{1}{m _{t}} i, j \sum l_{seg}^{i, j},

l_{seg}^{i, j} = {f (I_{t}^{i}, Y_{t}^{j}), 0, if i matches j, otherwise.

l_{seg}^{i, j} = {f (I_{t}^{i}, Y_{t}^{j}), 0, if i matches j, otherwise.

l_{count} (I_{t}, Y_{t}) =

l_{count} (I_{t}, Y_{t}) =

\displaystyle-(1-{\text{s}^{i}}^{*})\log{\Big{(}1-\max\limits_{i\leq u}\big{(}{\text{s}^{i}}\big{)}\Big{)}}

\text{J}^{k}_{t}=\max_{j}\big{(}{f(\textbf{I}^{k}_{t},\textbf{Y}^{j}_{t})}\big{)}.

\text{J}^{k}_{t}=\max_{j}\big{(}{f(\textbf{I}^{k}_{t},\textbf{Y}^{j}_{t})}\big{)}.

δ = [x_{t} y_{t} x_{t + 1} y_{t + 1}]^{T} .

δ = [x_{t} y_{t} x_{t + 1} y_{t + 1}]^{T} .

Vsim (δ, δ_{GT}) = \frac{δ . δ _{GT}}{∣ δ ∣ ^{2} + ∣ δ _{GT} ∣ ^{2}} + \frac{1}{2} .

Vsim (δ, δ_{GT}) = \frac{δ . δ _{GT}}{∣ δ ∣ ^{2} + ∣ δ _{GT} ∣ ^{2}} + \frac{1}{2} .

\displaystyle\text{BVs}_{\,t}^{\,i}=\max_{\bm{\delta}_{GT}}\big{(}\textit{Vsim}({\bm{\delta}^{\,i}},{\bm{\delta}}_{GT})\big{)},\quad i\in\{1,...,m_{t,t+1}\}.

\displaystyle\text{BVs}_{\,t}^{\,i}=\max_{\bm{\delta}_{GT}}\big{(}\textit{Vsim}({\bm{\delta}^{\,i}},{\bm{\delta}}_{GT})\big{)},\quad i\in\{1,...,m_{t,t+1}\}.

∣ D i C ∣_{t r an s} = \frac{1}{T} t \sum \frac{∣ m _{t, t + 1} - n _{t, t + 1} ∣}{n _{t, t + 1}},

∣ D i C ∣_{t r an s} = \frac{1}{T} t \sum \frac{∣ m _{t, t + 1} - n _{t, t + 1} ∣}{n _{t, t + 1}},

∣ D i C ∣_{e x t} = \frac{1}{T} t \sum \frac{∣ m _{t, e x t} - n _{t, e x t} ∣}{n _{t, e x t}},

∣ D i C ∣_{e n t} = \frac{1}{T} t \sum \frac{∣ m _{t, e n t} - n _{t, e n t} ∣}{n _{t, e n t}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Instance-Level Microtubule Tracking

Samira Masoudi1,2, Afsaneh Razi1, Cameron H.G. Wright2, Jesse C. Gatlin2, Ulas Bagci1 1 Masoudi, Razi, and Bagci are with University of Central Florida, Orlando, 32816 FL.2Masoudi, Wright, and Gatlin are with University of Wyoming, Laramie, 82071 WY.

Abstract

We propose a new method of instance-level microtubule (MT) tracking in time-lapse image series using recurrent attention. Our novel deep learning algorithm segments individual MTs at each frame. Segmentation results from successive frames are used to assign correspondences among MTs. This ultimately generates a distinct path trajectory for each MT through the frames. Based on these trajectories, we estimate MT velocities. To validate our proposed technique, we conduct experiments using real and simulated data. We use statistics derived from real time-lapse series of MT gliding assays to simulate realistic MT time-lapse image series in our simulated data. This data set is employed as pre-training and hyperparameter optimization for our network before training on the real data. Our experimental results show that the proposed supervised learning algorithm improves the precision for MT instance velocity estimation drastically to 71.3% from the baseline result (29.3%). We also demonstrate how the inclusion of temporal information into our deep network can reduce the false negative rates from 67.8% (baseline) down to 28.7% (proposed). Our findings in this work are expected to help biologists characterize the spatial arrangement of MTs, specifically the effects of MT-MT interactions.

Index Terms:

Microtubules, TIRF microscopy, instance-level segmentation, instance-level sub-cellular tracking, microtubule-microtubule interaction.

I Introduction

Microtubules (MTs) are cytoskeletal polymers within eukaryotic cells composed of individual $\alpha$ - and $\beta$ -tubulin subunits with head-to-tail arrangement. The inherent asymmetry of the heterodimeric subunits produces polar filaments with two distinct ends (one termed “plus” for the dynamic end and the other “minus” for the more stable end) [1, 2]. The polar structure of MTs and their highly regulated growth dynamics make them good candidates for many tasks vital to the maintenance of cell homeostasis including intracellular transport, cell migration, asymmetric polarization, and cell division [2]. MTs are the primary components of the mitotic spindle which is assembled by the cell to segregate sister chromatids during cell division. Considering their fundamental roles in myriad cellular processes, it is not surprising that perturbations of MT function can lead to diseases ranging from cancer to neurodegenerative disorders [3]. Therefore, a quantitative analysis of MTs is important for understanding the mechanistic underpinnings of many diseases at the molecular level.

In this study, we characterize two distinct types of MTs’ dynamics:

Individual MT growth dynamics: each single MT (with any mobility status: mobile, immobile, others) undergoes several stochastic transitions between growth (subunit attachment), pause, and shrinkage (subunit detachment) at its plus end [4]. This is called dynamic instability. 2. 2.

Interactions between MT instances: motor-dependent movement causes MTs to interact with their surroundings and particularly with other MTs. This interaction occurs through direct contact and/or by crosslinking via specific motor and non-motor proteins known as microtubule associated proteins (MAPs) [2, 5].

While various investigations have focused on dynamic instability by tracking MT plus-ends only [2], the second dynamic type, i.e., changes in MT behaviour, due to interactions with motors, other proteins and MTs, is still an area that requires additional investigation. MTs are arranged in space by motor-dependent crosslinking [6]. The resultant movement is thought to be dependent upon the motor type and density as well as the number and nature of static non-motile MAPs [6]. Due to sliding-filament mechanisms, the combined actions of these active and static MAPs, spatially organize MTs relative to each other [7, 6]. Intracellular complexity often precludes informative in-vivo investigation on such sliding filament mechanisms. Experimentalists can circumvent this limitation by employing reductionist in-vitro approaches termed as MT-gliding assays [6]. In these assays, MTs are labeled and tracked as they are moved along a coverslip surface by surface-bound MT-dependent motors [6]. Although this class of assays has been used extensively [6], the inherent utility of the approach is limited by a general lack of available, objective, and automated tracking methods. Here we describe the development of a gliding assay analysis method that minimizes subjectivity by applying recurrent attention to identify and segment MTs.

To generate novel data for our analysis, we perform similar assays using MTs assembled in cell-free extracts derived from Xenopus laevis eggs [8]. In these assays, MTs are spiked with fluorescently labeled MAPs [9]. Besides, endogenous cytoplasmic motors in the extract bind to the coverslip surface nonspecifically to power the MT gliding. The extract also contains a large complement of non-motor MAPs that are thought to decorate MTs along their lengths and potentially affect MT-MT interactions via binding and/or crosslinking. The depth of the flow chamber used in these studies is $\sim$ 50 to 60 times greater than 25nm diameter of MTs, providing sufficient space to enable multiple MTs to freely slide over each other. Total internal reflection fluorescence (TIRF) microscopy is used to visualize MT movements and dynamics. Time-lapse image sequences are recorded from TIRF microscopy for our analyses.

Qualitative analysis of our image sequences indicates that sudden changes in MT velocity, in terms of direction or amplitude, often occur concurrently with obvious MT-MT collision and interaction. Such an event is depicted in Figure 1, where interaction among three MTs results in obvious change their velocities. MT velocity is defined as a motion vector with respect to the leading end of the MT (i.e. its head) disregarding its dynamic instability [10]. As can be seen in Figure 1, velocity changes are manifested as changes of either amplitude (MT3, blue vector in II and III), direction (MT1, green vector in III and IV), or both (MT1, green vector in II and III).

To characterize these changes, one must track individual MTs in sequential frames. However, instance-level MT tracking problem is challenging for the following reasons:

•

low diversity in MT appearance,

•

time-varying nature of the features as a result of dynamic instability and photobleaching,

•

abrupt appearance/disappearance of MTs (caused in part by the use of TIRF microscopy which illuminates only the first 100nm of depth from the coverslip surface), and

•

unexpected changes in MT shape, ascribed to MT interaction and collision.

To address these challenges and limitations inherent to the existing methods, we propose a generic solution composed of two distinct but complementary parts. The first part of our solution introduces a novel instance-level MT segmentation method (at each frame). The second part tracks MTs along time-lapse images by utilizing these instance segmentation results in a data-association framework. This research is a big step toward development of an analysis platform tool which enables biologists to characterize the effect of MT-MT interactions on MTs velocity.

I-A Related works

Early visualization of MTs started with time-lapse images captured from either cells injected with fluorescently-labeled tubulin or those expressing fluorescent protein-tubulin fusions [11, 12]. Application of this method was typically restricted to the periphery of interphase cells where MT density was sufficiently low to capacitate the high contrast imaging of individual MTs. The in vivo exploration of MTs, evolved considerably with the use of fluorescently labeled +TIPs, proteins that bind specifically to the growing MT plus ends [13]. Tracking the +TIPs revealed descriptive parameters of dynamic instability like MT nucleation rate and growth speed [14, 15, 16]. The computational analyses for +TIPs tracking, are principally derived from multiple particle tracking algorithms in contrast with few solutions based on dense field motion detection [17].

Literature on multiple particle tracking implies two steps of (1) recognition of the relevant particles, and (2) associating the segmentation results [18, 15, 16]. The performance of each step directly affects the quality of the obtained spatiotemporal trajectories.

Literature on segmentation methods from pre-deep learning era is vast: clustering, region growing, morphological filtering, template matching, wavelet decomposition, graph and fuzzy set algorithms [19]. However, only a few of these methods are applied to the context of sub-cellular particle detection as well as MT segmentation (see a comprehensive review [20]. Among such methods, the oldest one is thresholding that takes advantage of differences in fluorescent intensity between the objects being tracked and the background. Previously in [10], we employed a global threshold value via Otsu’s method to segment MTs. Debated by [18], global thresholding alone cannot afford the ideal segmentation in microscopy images where noisy background, poor image quality, and heterogeneous particles exist. Various pre-processing ideas have been developed to partially solve such difficulties. For instance, authors in [16] applied a Gaussian band-pass (BP) filter before global thresholding. Similarly, [21] used Gaussian denoising and morphological operation followed by thresholding for +TIPs segmentation. To avoid the shortcomings of global thresholding, [18] applied local thresholding where local thresholds are the local maximum of the BP filtered image. Thresholding was usually used to generate seed points, feeding an additional algorithm for more precise delineation. Such algorithms in literature include either region growing [16] or watershed segmentation [18]. Despite these engineering efforts, over and under-segmentation problems persisted. In another attempt, [18] employed a post step of template matching similar to [16] to benefit from the shape of desired objects in cell. In a different line of research, [22, 23] used wavelet decomposition for object detection. Regardless of the specifics of the approach used, none of these methods allows to segment MTs in a sequence of time-lapse images for ultimate purpose of tracking. This inability is due to the time-varying nature of the image intensities, caused by photobleaching, molecular-level processes, or unique sub-cellular dynamics.

Deep learning based instance-level segmentation has became a rapidly growing area of study in recent years. Popular methods such as [24, 25, 26, 27, 28] mostly perform simultaneous instance-level and semantic segmentation for both classification and segmentation. Conventionally, the regulation of relevant strategies is composed of a box proposal followed by parallel processing for classification and detailed segmentation. The rationale for using the bounding box is due to the strong coupling between segmentation and object detection: once the object is found, delineation can be performed within each box (detected object). Mask R-CNN [29], and its extensions, [30], and [31] are among the recent works with great potentials in this stream. However, training this type of algorithms demands huge collection of labor-intensive annotations which is major drawback in case of biomedical applications.

Several data association approaches exist in literature. These algorithms optimize the association cost among the obtained results from two [32, 33], or more frames (multiple succeeding frames [34, 35] or larger batches of frames with more complex graph pruning techniques [36, 37]). Several challenges emerge in assigning the segmented objects from individual frames to each other. Among these, the problem of low signal to noise ratio (SNR) was resolved by the application of probabilistic approaches [15]. Even in presence of adequate SNR, attributing the suddenly appearing/disappearing particles to their true trajectories was a real struggle. +TIP imaging exemplifies this issue where its inability to visualize MTs during the pause and shrinkage phases, necessitates extra processing [16, 1, 18, 32]. To compensate for MTs missing phases computationally, an algorithm was proposed by [32]. The plusTipTracker software package [18] was designed based on this algorithm to trace MT plus ends in +TIP images. The heterogeneous growing patterns exhibited by +TIPs was yet another challenging aspect. Interacting multiple model filtering, piecewise-stationary motion modeling, and piecewise-stationary multiple motion Kalman smoother are the latest studies that incorporate the Bayesian prediction power to optimize assignment [36, 38, 37, 10]. There is a complete literature review on the most common data association techniques in particle tracking applications in [39]. It is known that false negative rates are far more problematic to data association than imperfect detection. By avoiding mis-detections (reducing the false negatives), there is no need to use sophisticated multi-frame linking techniques [39]. The most recent development in this area is the application of deep learning to the problem of data association in multiple particle tracking [40].

Sub-cellular particle tracking can be be potentially addressed by dense motion detection. Optical flow (OF) is a basic features to describe motion in a dense field [17]. There are several strategies for OF computation, some of which were applied to microscopy images. [41] and [42] used OF for cell tracking and motion estimation of cellular structures. In this regard, Horn and Schunck OF (HS-OF) computation method is a global approach based on two assumptions: gray value constancy and smooth flow of the intensity values. Later, [42] utilized additional constraints to extend HS-OF to combined local global method. Tracking solutions that incorporate OF computation solely have many drawbacks: losing small moving structures due to the coarse to fine decomposition, over-smoothing motion discontinuities caused by variational optimization, and having difficulty in dealing with illumination changes [43]. To address these limitations, a patch-based two-step aggregation framework was proposed to estimate the motion patterns of cellular structures [43].

Research gap: Literature on MTs is particularly focused on descriptive parameters of dynamic instability. Our project presents a new problem toward estimating the (translational) velocity of MTs during in-vitro gliding assays. We fulfill this task through a deep instance-level MT segmentation and associated tracking method. Individual MT velocity estimation demands identification of each single MT which is performed using instance-based segmentation. Unlike other instance-level segmentation methods in computer vision applications, we focus on MTs only (foreground) to avoid category dependant computations, extensive number of parameters, and numerous costly annotations. We take advantage of attention modeling to improve segmentation results and allow separation of MTs. In contrast to [44, 45, 46] who used attention to get fine-grained details of a single instance, our study uses attention to exploit the spatial relation among different MT instances in an image. After identifying the attention region(s), the exact instance mask is thoroughly segmented.

II Methods

II-A Overview of the proposed method

Our method can be best described under two headings: Part 1) instance-level MT segmentation at each frame, Part 2) MT association among successive frames. The first involves segmentation which becomes extremely challenging when MTs overlap or collide. To alleviate this, we present a new instance-level segmentation algorithm utilizing a recurrent neural network (RNN). The segmentation procedure is guided by a novel visual attention module repeatedly processing a single frame to segment its MT contents. This module facilitates efficient delineation of individual MTs even when MT-MT interactions exist. We describe our solution in five steps. Step 1 is the data preparation module at time $t$ : the current frame, a sequence of its neighboring frames or their respective OFs, and weighted sum of already segmented instances at the current frame are grouped together as input. Step 2 describes the visual attention module which proposes where to focus for segmentation. Step 3 includes the segmentation unit for each instance inside the area suggested by Step 2. Step 4 is a counter function validating Steps 2 and 3 to decide when to stop iterating over the same frame. In Step 5, the most recent segmented instance joins all the previously recognized instances from the same frame and the weighted sum is fed back to the input. The algorithm repeats the same procedure on the same frame, until the counter in Step 4 signals to stop. At this level, we obtain instance-level segmentation results at the $t^{th}$ frame. So MT instances can be segmented in every single frame following this procedure. Later in Part 2, we use Hungarian algorithm [47], to assign the segmented MT instances along every pair of the succeeding frames. As a consequence, we get trajectories of MTs along the frames that promote MT velocity estimation. Figure 2 shows the flowchart of our segmentation platform.

To the best of our knowledge, this is the first study exploring the problem of instance-level MT velocity estimation with a deep learning algorithm. Due to the limited and extremely hetergeneous nature of our real data, we first create a simulated data based on statistics derived from the actual time-lapse microscopy images of MTs. Such simulated data provides a means to pre-train our deep learning framework and optimize its hyperparameters before fine-tuning on our limited real data.

II-B Problem statement

The problem throughout this paper is to estimate the translational velocity of each individual MT along the subsequent frames in a given set $\textit{{I}}=\{\textbf{I}_{1},\textbf{I}_{2},...,\textbf{I}_{T}\}$ . While all these frames share similar dimensions: $\text{H}_{1}\times\text{W}_{1}\times 3$ (3 RGB channels), each may contain a different number of instances due to MTs sudden appearance/disappearance, marginal entrance and egress. The true number of MT instances at the ${t}^{t}h$ frame is ${n}_{t}$ which are denoted by binary ground truth masks: $\{\textbf{Y}^{1}_{t}$ , …, $\textbf{Y}^{n}_{t}\}$ .

As previously explained, our method is composed of two parts: for Part 1, we propose a configuration that sequentially goes through single frames from I to perform instance-level MT segmentation at each frame. As a result, we obtain ${m}_{t}$ binary masks of MT instances at the ${t}^{t}h$ frame that are represented by: { $\textbf{I}^{1}_{t}$ ,…, $\textbf{I}^{m}_{t}\}$ . These obtained masks are compared against ${n}_{t}$ binary ground truth masks to evaluate our segmentation performance at this frame. Once we segment all MTs in each frame along a sequence of successive frames, we move on to Part 2. For this Part, we associate the results from each pair of successive frames $\textbf{I}_{t}$ and $\textbf{I}_{t+1}$ to recognize an individual path for each MT and estimate its velocity. We want our network to learn to segment instances with conflicting areas. To facilitate this, we append all binary ground truth instances through their third dimension and form a 3-D label tensor $\textbf{Y}_{t}$ . Using a 3-D label while training, enables our network to account for the overlapped area among individual instances at $\textbf{I}_{t}$ .

II-C Part 1: Instance-level MT segmentation in a single frame

Inspired by [48] and [49], we present a new system of attention to accurately segment individual MTs. The attention module generates particular Gaussian kernels to specify where to look for the next instance. These kernels blur out the exterior and enhance the interior of the attention area. Later, the segmentation network extracts MT(s) from the suggested region. Unlike [48], segmentation herein is our intermediate goal to realize instance-level MT velocity estimation along time-lapse images at the end. Additionally, MTs overlap considerably hence there is a need for specific type of instance segmentation as demonstrated by Figure 3.

Our work has improved [48] in three major ways. We use 3-D labeling to secure a comprehensive segmentation in case of overlapped instances: layers of ground truths $\textbf{Y}_{t}^{k}$ for all individual instances with potential common areas are appended through their third dimension and form a 3-D label tensor $\textbf{Y}_{t}$ as is depicted by Figure 4.

In addition, we extend [48] into the temporal domain to collect sufficient cues for segmenting the concealed areas. We utilize former and future frames to obtain location- and appearance-related evidences that support our algorithm to segment overlapping instances. Finally, we improve the long short-term memory (LSTM) implementation from a fixed number of iterations in [48] to a conditional convergence. Using a constant number of iterations can critically restrict the LSTM performance in proposing a new attention region.

To elaborate our proposed strategy for instance-level MT segmentation at frame $\textbf{I}_{t}$ , we assume our network to begin its ${k}^{th}$ iteration in an attempt to find the ${k}^{th}$ ’s instance. Terms ${\textbf{G}_{t}}^{k}$ (defined in Equation 1) and ${\textbf{I}_{t}}^{k}$ , respectively describe the input and output of our algorithm at time $t$ for instance $k$ . With these assumptions, the overall structure of our network has the following elements:

II-C1 Input

To provide temporal information at the input, we use the frames in neighborhood of the current frame. We also include ${\textbf{C}_{t}}^{k}$ , a weighted average of all previously segmented instances at the current frame (see Equation 9 for details), to accommodate reasoning about a new instance. A sequence that contains L number of (former) frames, current frame, and L number of (future) frames alongside the most updated ${\textbf{C}_{t}}^{k}$ , form a tensor to supply the input group, $\textbf{G}^{k}_{t}$ :

[TABLE]

Tensor $\textbf{G}^{k}_{t}$ is the input to both the visual attention and the segmentation block. Yet, we examine another version of data preparation at the input, where we substitute the neighboring frames with their respective OFs to feed the visual attention. Using OF provides visual attention with indicative features of the motion vectors. For this purpose, we follow the work of Liu [50] to compute OF from each pair of successive frames within the neighborhood of 2L+1. The resulting set of OFs, $\mathbold{\Phi}_{t,\textit{L}}$ , the current frame, and ${\textbf{C}_{t}}^{k}$ all together set up an input for visual attention. Experimental results for this trade-off are explained later in Section III.

We will henceforth drop the subscript $t$ and superscript $k$ for clarity. All subsequent terms that describe quantities inside the visual attention, segmentation, and counter function are assumed to at time $t$ for instance $k$ .

II-C2 Visual attention

Our proposed attention network contains a convolutional neural network (CNN) followed by LSTM in the spatial-domain as depicted by Figure 5. The goal is to provide the region of interest for a constrained segmentation. First, the CNN passes the input volume through successive layers of convolutional filters and max pooling such that the $\text{H}_{1}\times\text{W}_{1}$ area of the input group is reduced to $\text{H}_{2}\times\text{W}_{2}$ non-overlapping tiles. Hence, the CNN generates a feature tensor Q of shape $\text{H}_{2}\times\text{W}_{2}\times\text{D}$ , where each spatial location is a ${D}$ -dimensional feature vector expressed by $\textbf{q}_{{h}_{2},{w}_{2}}$ , as illustrated in Figure 5.

Next, we utilize LSTM to model the spatial causality among multiple instances in a frame; i.e., it uses the spatial features of former instances: $\{1,...,k-1\}$ segmented at the current frame to estimate the area of attention for ${k}^{th}$ instance at the same (current) frame. Upon receipt of the feature tensor, the LSTM begins iterating to find those tiles in Q that contribute the most to the attention region. After any given ${u}^{th}$ iteration, LSTM produces a hidden state vector $\textbf{z}^{{u}}$ and a 2-D matrix, $\mathbf{A}^{{u}}$ of size $\text{H}_{2}\times\text{W}_{2}$ . Every entry in matrix $\mathbf{A}^{{u}}$ expresses the level of contribution to the attention region for its respective tile in Q. Initiating with equal involvement of all tiles, they gradually fine-tune (Eqs. 2 and 3):

[TABLE]

where MLP(.) in this equation denotes a single hidden layer multi-layer perception with 5 hidden units, and

[TABLE]

The LSTM repeats until each element in $\mathbf{A}^{{u}}$ converges or $u$ reaches a set maximum number. We refer to the last iteration number as U and use it as an upper index to specify the ultimate hidden state $\textbf{z}^{\text{U}}$ . Using a linear transformation of vector $\textbf{z}^{\text{U}}$ , we compute description parameters of the attention region as shown in Eq. 4. These parameters define mean and standard deviation of two Gaussian kernels $\textbf{F}_{x}$ and $\textbf{F}_{y}$ along the $x$ and $y$ axes:

[TABLE]

Gaussian kernels ( $\textbf{F}_{x}$ and $\textbf{F}_{y}$ )are calculated using,

[TABLE]

Then, we transform the input $\textbf{G}^{k}_{t}$ into P (Eq. 7).

[TABLE]

Doing so, we fulfill two tasks simultaneously: first, intensifying the attention area, while attenuating the rest of the frame. Second, re-sampling the $\text{H}_{1}\times\text{W}_{1}$ area of input group into ${\text{H}_{3}}\times{\text{W}_{3}}$ for magnified details in P ( $\text{H}_{3}<\text{H}_{1}$ and $\text{W}_{3}<\text{W}_{1}$ ). In other words, pixels from the original current frame contribute to each pixel in the attention region according to matrices $\textbf{F}_{x}$ and $\textbf{F}_{y}$ .

II-C3 Segmentation

To segment an object in the attention region, we apply a back-to-back Encoder-Decoder similar to [51]. This design transforms our attention-magnified input group P into a $D^{\prime}-$ dimensional feature vector v first. Later, v is decoded into a pixel-wise prediction map $\hat{\textbf{P}}$ . To have the segmentation result in a comparable size with the original image, we undo the effect of the Gaussian kernels:

[TABLE]

where again all of these quantities are being computed at given time $t$ and instance $k$ .

II-C4 Counter function

A critical addition to this architecture is the counter function which determines whether the attention region includes an instance or not. The counter function is made of a fully connected layer and a Sigmoid function. This module takes in a concatenation of two vectors $\textbf{z}^{\text{U}}$ (the most updated RNN hidden state) and v (encoder output) to generate a score value $s^{k}$ (see Figure 6). We train the weights in the counter function so that the value of the $s^{k}$ lies in the interval of $[0,1]$ . A higher score expresses more certainty toward the instance segmentation. As is depicted by Figure 6, the counter function acts like a switch during the test. If $s^{k}$ surpasses a certain threshold, the network counts the segmented instance as a successful attempt, moves on to segmenting the next instance $k+1$ for the same frame $\textbf{I}_{t}$ . Otherwise, an unqualified instance segmentation forces the network to stop iterating through the same frame, reset itself and step forward to the next frame, $\textbf{I}_{t+1}$ along the time-lapse images.

II-C5 Feedback

Each segmented instance joins all previously segmented instances to constitute ${\textbf{C}_{t}}^{k}$ . Feeding this weighted average as part of the input set (Eq. 9) into our network facilitates the segmentation of future instances in two ways: reduces the chance of selecting a region among already assigned areas, and provides a prior to the network based on the potential relation between various instances:

[TABLE]

Training for segmentation: Our proposed instance-level segmentation algorithm closely ties the segmentation and attention networks to each other. However, the level of dependency varies among these two networks. The segmentation performance is directly determined by the attention accuracy but the segmentation results only provide extra guidance to the attention network. Such coupling forces us to implement the training procedure in two stages. First, we ignore the segmentation network and train the attention network only. Second, we train the whole network by fine-tuning the attention weights and optimizing the segmentation network from scratch. At this stage, feeding back the premature segmentation results into the network can be misleading. Therefore, we define a ”tuning-knob” parameter. This parameter enables us to feed the ground truth instance into the network and gradually replace it with the results from the segmentation network as the training progresses. Since the counter function must be trained to distinguish successful performance, in both training stages, we force our algorithm to iterate $M$ times through each frame, where $M$ is determined from Eq. 10:

[TABLE]

where again ${n}_{t}$ is the true number of MT instances at frame $t$ and $T$ is the total number of frames. This choice of $M$ provides the opportunity for the counter function to learn about acceptable vs. non-acceptable performance for instance-level segmentation. To account for the overlapped area at a single frame, we use 3-D label tensor. Thus, our network learns about instances with conflicting areas that are either directly visible or hard to perceive due to occlusion. To compute the loss function we must evaluate the ground truth against our results. Since ground truth instances and segmentation results do not follow the same order, Hungarian algorithm is chosen as likely a solution to optimally match results with labels, using a cost matrix (Figure 7).

We stipulate the cost function $f(\textbf{A},\textbf{B})$ as a measure of similarity between two binary masks A and B of the same size:

[TABLE]

where $\textbf{A}\circ\textbf{B}$ represents the Hadamard product and summation is performed over all computed entries. Obtaining $f$ for each pair of a segmented instance and a ground truth, we from the cost matrix. In this matrix, Hungarian algorithm crosses out the higher values as absolute matches (such as $\textbf{Y}_{t}^{2}$ and $\textbf{I}_{t}^{1}$ in Figure 7) and optimize the rest of the matching procedure subsequently. As a result, we obtain a matrix where the $(i^{th},j^{th})$ element expresses the correspondence between the $i^{th}$ segmented instance (for any $i\in{1,...,{m}_{t}}$ ) and $j^{th}$ label, ( $j\in{1,...,{n}_{t}}$ ) at time $t$ .

Loss function: For the loss function, we use the terms defined in Table I at time $t$ , for the $i^{th}$ segmented instance and $j^{th}$ label instance.

The total loss $L$ is defined as the sum of $L_{\text{att}}$ (loss of the attention network), $L_{\text{seg}}$ (loss of the segmentation network) and $L_{\text{count}}$ (loss of the counter function):

[TABLE]

Based on the matched results, we define $L_{\text{att}}$ as follow:

[TABLE]

where

[TABLE]

We use function $f$ as is defined in Equation 11 to compute the number of shared pixels between the $i^{th}$ proposed attention region and the $j^{th}$ true detection box as $l_{\text{att}}$ . For segmentation loss, $L_{\text{seg}}$ , we use

[TABLE]

with ${l_{\text{seg}}}^{i,j}$ formulated to weight the similarity between $i^{th}$ segmented instance and the $j^{th}$ label instance and defined as

[TABLE]

Finally, for $L_{\text{count}}$ , we employ a monotonic score loss proposed by [48], since counter function must compare high vs. low scores to make the network select more confident objects first:

[TABLE]

During the test, our algorithm iterates over the test frame(s) and produces a score by the counter function. If this score falls under a certain threshold, the algorithm stops iterating through the same frame, resets, and moves on to the next frame along the sequence of the time-lapse images.

II-D Part 2: Data association

After segmentation step, we associate the segmented instances for every two consecutive frames ( $t$ and $t+1$ ). For this purpose, we use an associating-purpose Hungarian algorithm with a cost function $f(\textbf{I}_{t}^{i},\textbf{I}_{t+1}^{j})$ to represent the ( ${i}^{th}$ , ${j}^{th}$ ) element of the cost matrix (Figure 8). This function calculates the Intersection of Union (IoU) between the ${i}^{th}$ segmented instance at time ${t}$ and the ${j}^{th}$ segmented instance at ${t+1}$ . Doing so, we obtain three types of MT counts during the test:

•

$m_{t,t+1}\leq\min\big{(}m_{t},m_{t+1}\big{)}$ which is the number of instances being transferred to the next frame in a one-to-one manner.

•

$m_{t,ext}=m_{t}-m_{t,t+1}$ indicating the number of instances at frame ${t}$ which left the scene by either sudden disappearance or exiting the frame.

•

$m_{t,ent}=m_{t}-m_{t-1,t}$ expressing the number of instances at frame ${t}$ that enter the frame by a sudden appearance or simply move into the frame.

These counts help to compute the displacement among segmented instances in successive frames.

III Results

III-A Data sets

A common problem in biomedical imaging is the lack of large amounts of precisely annotated data. Herein, we have the same challenge; thus, we generate a type of data that closely simulates the actual microscopy images of MTs.

III-A1 Simulated data

Similar to the real time-lapse images, the simulated data set should have RGB-channels captured from the central area of a large predefined frame to resemble the entrance and egress of MTs on the edges. Specific settings for generating a simulated sequence of the frames include: size and number of the frames, spatial and time resolution, initial number of the instances, their geometric specifications, and motion parameters. As it can be seen in Figure 9, MT instances are represented by wagon-train-shape structures with identical width. To better imitate the statistical characteristics of the real MTs such as their length, velocity, and dynamic instability, we collect respective information from real imaging data. We take samples of real instances with at least two experts having unanimous labels for them. We then fit a multi-modal Gaussian distribution to each relevant characteristic of our samples, since it has the least mean square fitting error. This measure matches the maximum likelihood criteria, since a Gaussian distribution is known to happen. The resulted distributions are shown in Figure 10, where for each property, we set the number of modes in its distribution function equivalent to the number of states it may adopt. For instance, we use 3 modes in case of dynamic instability implying 3 phases of: shrinking, growing, and pause. In compilation of our simulated data, we make the MTs to take characteristics following the obtained distributions. Additionally, to mimic TIRF microscopy, another variable is taken into account indicating the sudden appearance/disappearance of the MTs. We utilize a contrast within our simulated image intensities where MTs have brighter interiors and even much lighter in their overlapping areas (see Figure 9). After generating the simulated data, we ignore the first few frames to avoid any bias induced by the initial conditions. All these details afford greater fidelity to the real-world microscopy images. At the end, we generate 40 simulated image sequences, each has 379 frames of size $256\times 256$ pixels. We use 4 randomly chosen sequences as the test data while the remainder is used for training with 5-fold cross validation setting.

III-A2 Real data

The real data is gathered in-house, containing 23 RGB, time-lapse sequences, each having a duration of $23.16\pm 12.68$ seconds. Every sequence is sampled at a rate of 16 frames per second with $256\times 256$ pixels spatial dimension. We use 19 sequences of this data set for training repeated under cross-validation setting and utilize 4 other to test our algorithm. This data set is annotated by three experts, who were asked to click on five points along the length from head to tail of what they interpret as a single MT. This (five) was the smallest number found empirically to be sufficient extracting individual MTs in complex overlapping scenarios. The experts went through the whole time-lapse sequence to label MTs. Using these 5-point labels, we extract MT bodies with further processing based on thresholding, region growing, and template matching algorithms [10]. Since there are some intra-variations between the obtained labels, in each case we decided to use the most voted areas over all three labels as our ground truth. The resulting annotations are used to train our network. We also directly use the coordinates of the head of each MT in consecutive frames to have thorough description ( ${\bm{\delta}}_{GT}$ ) of the ground truth displacement. These vector labels are used in our final evaluation.

III-B Evaluation Metrics

To evaluate the segmentation performance of our proposed method in segmenting the $k^{th}$ instance at the $t^{th}$ frame, we employ conventional Jaccard index ( $J$ ) as defined in Equation 18 :

[TABLE]

We report the best value for $J$ as well as the average $\text{J}^{k}_{t}$ value obtained over all instances in Table II. Since the ultimate goal in this study is to estimate MT motions along sequential frames, we quantify the overall performance of our proposed method in terms of displacement estimation. It should be noted that displacement is measured with respect to relocation of MT leading ends or heads. In the obtained trajectories, we subtract every two assigned instances at consecutive frames ( $\textbf{I}_{t+1}-\textbf{I}_{t}$ ) to have the area presenting the head of the MT. We use the center of this area and present it using $(x,y)$ . We define displacement ${\bm{\delta}}$ as in Eq. 19:

[TABLE]

To evaluate the similarity between two displacement vectors: ${\bm{\delta}}$ and the ground truth ( ${\bm{\delta}}_{GT}$ ) in terms of their orientations or magnitudes, we introduce a novel measure Vsim in Eq. 20:

[TABLE]

We obtain the best Vsim value (i.e., BVs) for the $i^{th}$ displacement vector obtained at transition from $t^{th}$ to the ${t+1}^{th}$ frame, and reported BVs and average $\text{BVs}_{\,t}^{\,i}$ over all displacement vectors:

[TABLE]

After assigning each displacement to its equivalent ground truth, we count the true positives. We define true positive according to the intra-variation existing between the three experts labeling outcomes, where we let our network to make errors less than the difference between the labels obtained from the experts. False discovery rate (FDR) is measured as a ratio of the false positives to the total number of computed displacements at each frame where false positives are the vectors that were not assigned to any ground truth or if assigned, they did not fulfill the requirements of being a true positive. We also define false negative rate (FNR) as the ratio of the ground truth vectors with no attributed estimated vector to the total number of the ground truth vectors at each frame. Eventually, Difference in Counting (DiC) is used to compare the counted number of segmented instances against the ground truths:

[TABLE]

where sub-indexes $trans$ , $ext$ , and $ent$ respectively denote the MTs which transitioned, exited, and entered the $t^{th}$ frame.

III-C Qualitative evaluations

Some of instance-level MT segmentation results are presented in Figure 11. As shown, there is a significant positive correlation between the ground truth and results of our best model for both simulated and the real data. The visually perceivable results of MT tracking are provided in Appendix 13, where MTs’ heads displacement are demonstrated with their velocity amplitude.

III-D Quantitative evaluations

We analyzed the performance of our algorithm in more details by providing five Tables. All values are obtained using threshold value of 0.23 for counter function and are averaged over all frames, all instances (if applicable) in test data. This threshold value minimizes the average $FN\times FP$ (in terms of segmentation results) for a set of 100 randomly chosen frames from the training set. Having such configuration, our optimum design runs at 250 ms per frame ( $256\times 256$ ) to perform segmentation.

Tables II and III illustrate the distinction of our method (optimum design) in terms of segmentation and velocity estimation against two baseline methods: adaptive template matching and piece-wise stationary multiple motion Kalman smoother (PMM Kalman smoother). Adaptive template matching updates an initial set of templates with results obtained from 3 past frames [10]. PMM Kalman smoother uses piece-wise stationary multiple motion model. As shown quantitatively in both tables, our framework outperforms the baseline results. A reduction of at least 0.235 in FDR and 0.104 in FNR in velocity estimation confirm the greater capability of our method in dealing with complicated problem of instance-level MTs segmentation and tracking. Results express a reduction of at least 0.066 in FNR, along with an acceptable FPR result. This value of the FPR does not degrade the performance of our algorithm in velocity estimation. Additionally, we have compared our algorithm to two instance-level segmentation methods ( [26] and [27]). Confirmed by Table VII, our algorithm results in significant reduction of the FNR.

Table VI demonstrates the potentials of using RNN component comparing to CNN component as a visually attentive operator. Tables IV and V summarize the contrast between two possibilities of using original frames or their respective OFs as part of the input to visual attention. Results show that using OF can increase the precision up to 0.29 in case of real data.

Ablation study: To assess our proposed visual attention module, we substitute the CNN+RNN with CNN in Exp-1. In case of using CNN only, the descriptive characteristics of the bounding box are produced by the last fully connected layer. This experiment investigates our proposed design against methods in [45, 44, 46], which results are presented in Table VI. As shown, The spatial reasoning of LSTM leverages its ability in comparison to using CNNs. Although, results obtained in either cases are of a comparable order. In another experiment (Exp-2), we demonstrate how the injection of temporal information into our instance-level segmentation network improves the quality of displacement estimation. To this end, we use various number of frames to supply the network. Results, reported in Tables IV and V, support the idea that using neighboring frames leads to better estimation and less miss-detections (false negatives).

In Exp-3, we examine whether temporal information affords better estimation in form of raw neighboring frames or their respective OF. Comparing Tables IV and V reveals the results of this experiment, indicating a dramatic progress in case of OF. It is because using OF adds an initial motion clue to information at the current frame, which leads to more accurate detection path.

Finally, Exp-4 is designed to study the network performance in case of simulated vs. real data. Results from Tables IV and V express that real data is harder to analyze. For both data categories, we get improved results from our trained model compared to the baseline methods. Unlike our all-embracing labels for the simulated data (which includes the exact pixels of every MT), we had only 5-marker coordinates directly annotated by experts due to limited time and expertise. While the questionable nature of these labels can crucially resolve the network’s performance in segmentation, it could not prevent a significant contribution to displacement estimation.

Exp-5 and Exp-6 are aimed to study the robustness of our algorithm. In Exp-5, we quantify the algorithm performance in terms of instance-level MT segmentation for different crowdedness in each frame. As it is demonstrated by Table IX, overpopulated scenes degrade the segmentation results in terms of significantly higher FPR and FNR. However, such failure rate specifically manifests itself when the algorithm faces a crowd of more than 30 MTs whitin a frame. In Exp-6, the quality of velocity estimation is evaluated against sampling rates of a time-lapse sequence. According to Table IX, while downsampling (time-wise) deteriorates the performance in an obvious way, sampling rate of 2 and 4 lead to unacceptable results.

IV Discussions and Concluding Remarks

Several methods have been proposed to facilitate automated tracking of the growing ends of MTs. However, there is a paucity of automated approaches available to extract and measure the velocities of MTs in in-vitro gliding assays. The major hurdle is having the frequent MT-MT interactions causing abrupt changes in MT motion trajectories. The nature of MT motion in these assays renders manual inspection and simple modeling tools inadequate for velocity characterization and measurement. Both the human eye and simple methods tend to be biased by local changes in population density. The ever changing patterns of MTs motion make this characterization even more challenging. In summary, these limitations necessitate an automated approach with higher accuracy to accelerate the process of segmentation, tracking, and analysis.

In this study, we employed new algorithms resulting in fewer false positives and false negatives, while accounting for motion complexity. Our proposed approach iterates through attention and segmentation blocks to recognize one instance at a time. The presented attention network mimics human vision to set attention boundaries. Once attention network finds candidate regions of MTs, a back-to-back encoder-decoder engine is used to segment the relevant instances inside the candidate regions.

Despite achieving the state-of-the-art performance in MT segmentation and velocity estimation using our designed network, there are still areas for potential improvement in near future. Expanding the repertoire of real data sets, in the form of an extended library of time-lapse image series, is one of them. For simulated data sets, increasing the data size with realistic augmentation strategies may leverage the training quality. In this regard, conditional generative adversarial networks are potential methods to apply [52]. Previously, authors adopted “attention” to get fine-grained details of a single object in an image. This conventional attention concept is an artificial version of human vision in “looking at a particular scene while giving deep attention into details of a small compartment in the same scene”. However, in our work, we adopted the “attention” to exploit the spatial relation that exists among different instances of the MTs in a single frame. Hence, this version of attention can be interpreted as eye movement to switch attention from one instance to another, while they are not directly related. Our algorithm concentrates on different regions of a single frame and segments an instance in each region. Once all the results from individual frames are available, our algorithm associates the results to extract trajectories. We believe our algorithm has the least cost to perform instance-level MT segmentation and tracking among all other available methods. For instance, while the attention-based work by [53] can be a great fit for 1-D machine translation applications, its direct extension to our 2-D+t problem is costly. The intrinsic elements of this method such as query, keys, and values can not simply be mapped with a dot product. This structure can be extended into a version that could simultaneously incorporate both temporal and spatial dimensions of this problem. Such structure bypasses the requirement for a following data association algorithm and generates results with additional spatial accuracy and temporal smoothness. As this research progresses, it is hoped that it will provide an automatic segmentation platform to help researchers to better study the molecular basis for the motor-dependent spatial organization of MTs in both interphase and mitotic cells.

Qualitative results for tracking. Figures 12 and 13 respectively depict the input frames and the output frames which belong to a sample time-lapse sequence. Output frames represent the value of the velocity amplitude along the displacement for each individual MT.

Definitions of quantitative measures. In evaluating the segmentation, we define true positive as a count of the pixels that concurrently occur in both the segmented result and the ground truth. False positive, false negative and true negative are defined accordingly.

We also define true positive in case of data association when the following conditions are fulfilled with respect to the obtained displacement and the ground truth vectors:

•

The center of estimated displacement is less than 7 pixels away (Euclidean distance) form its corresponding ground truth value.

•

Both vectors have less than 30 degrees angular difference.

•

Both vectors have less than 10% order of magnitude difference.

-A Acknowledgement

We thank our colleague Paul Mooney (University of Wyoming) for providing the imaging data. We are also immensely grateful to John S. Oakey for his insight and expertise that greatly assisted the research. We thank Badrun Nessa Rahman and Yashasvi Bhat at the University of Central Florida for assistance with data labeling.

Bibliography53

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] K. T. Applegate, Quantitative image analysis algorithms for the measurement of cytoskeleton dynamics . Ph D thesis, The Scripps Research Institute, 2010.
2[2] T. Kreis and R. E. Vale, Guidebook to the Cytoskeletal and motor proteins , vol. 2. Oxford University Press, 1999.
3[3] E. Harrison, “ Medical News Today .” http://www.prezi.com/xz 88yonibenc/the-malfunction-of-the-microtubules , 2015. Online; Accessed 08 may 2017.
4[4] T. Mitchison and M. Kirschner, “Dynamic instability of microtubule growth,” nature , vol. 312, no. 5991, p. 237, 1984.
5[5] F. Pampaloni and E.-L. Florin, “Microtubule architecture: inspiration for novel carbon nanotube-based biomimetic materials,” Trends in biotechnology , vol. 26, no. 6, pp. 302–310, 2008.
6[6] J. Mcintosh and S. Cleland, “Anaphase sliding of spindle microtubules,” Journal of Cell Biology , vol. 43, no. 2 P 2, p. A 89, 1969.
7[7] C. E. Walczak and S. L. Shaw, “A map for bundling microtubules,” Cell , vol. 142, no. 3, pp. 364–367, 2010.
8[8] A. Desai, S. Verma, T. J. Mitchison, and C. E. Walczak, “Kin i kinesins are microtubule-destabilizing enzymes,” Cell , vol. 96, no. 1, pp. 69–78, 1999.