Long-term Correlation Tracking using Multi-layer Hybrid Features in   Sparse and Dense Environments

Nathanael L. Baisa; Deepayan Bhowmik; Andrew Wallace

arXiv:1705.11175·cs.CV·February 5, 2019

Long-term Correlation Tracking using Multi-layer Hybrid Features in Sparse and Dense Environments

Nathanael L. Baisa, Deepayan Bhowmik, Andrew Wallace

PDF

TL;DR

This paper introduces a long-term visual tracking algorithm that combines multi-layer hybrid features, an online classifier, and a re-detection module to effectively track targets in both sparse and crowded environments, outperforming existing methods.

Contribution

The paper presents a novel long-term tracking approach integrating multi-layer CNN features, traditional features, and a re-detection mechanism with probabilistic filtering, advancing robustness in complex scenarios.

Findings

01

Significantly outperforms state-of-the-art tracking methods on various datasets.

02

Effectively handles occlusions and appearance changes through re-detection.

03

Achieves high accuracy in both sparse and dense environments.

Abstract

Tracking a target of interest in both sparse and crowded environments is a challenging problem, not yet successfully addressed in the literature. In this paper, we propose a new long-term visual tracking algorithm, learning discriminative correlation filters and using an online classifier, to track a target of interest in both sparse and crowded video sequences. First, we learn a translation correlation filter using a multi-layer hybrid of convolutional neural networks (CNN) and traditional hand-crafted features. We combine advantages of both the lower convolutional layer which retains more spatial details for precise localization and the higher convolutional layer which encodes semantic information for handling appearance variations, and then integrate these with histogram of oriented gradients (HOG) and color-naming traditional features. Second, we include a re-detection module for…

Tables1

Table 1. TABLE I: Implementation parameters.

Parameters	$λ$	$σ$	$η$	C	$T_{r d}$	$T_{t d}$	$δ_{p}$	$δ_{n}$	$S$	$a$	$λ_{t}$	$U$	$T$
Values	$10^{- 4}$	0.1	0.01	2	0.15	0.40	0.9	0.3	31	1.04	4	4 pixels	$10^{- 5}$

Equations71

y^{(l)} (m, n) = e^{- \frac{( m - M /2 ) ^{2} + ( n - N /2 ) ^{2}}{2 σ ^{2}}},

y^{(l)} (m, n) = e^{- \frac{( m - M /2 ) ^{2} + ( n - N /2 ) ^{2}}{2 σ ^{2}}},

w^{(l)} min m, n \sum ∣Φ (x^{(l)}) . w^{(l)} - y^{(l)} (m, n) ∣^{2} + λ ∣ w^{(l)} ∣^{2},

w^{(l)} min m, n \sum ∣Φ (x^{(l)}) . w^{(l)} - y^{(l)} (m, n) ∣^{2} + λ ∣ w^{(l)} ∣^{2},

w^{(l)} = m, n \sum a^{(l)} (m, n) Φ (x_{m, n}^{(l)}),

w^{(l)} = m, n \sum a^{(l)} (m, n) Φ (x_{m, n}^{(l)}),

\mathbf{A}^{(l)}=\mathcal{F}(\mathbf{a}^{(l)})=\frac{\mathcal{F}(\mathbf{y}^{(l)})}{\mathcal{F}\big{(}\Phi(\mathbf{x}^{(l)}).\Phi(\mathbf{x}^{(l)})\big{)}+\lambda},

\mathbf{A}^{(l)}=\mathcal{F}(\mathbf{a}^{(l)})=\frac{\mathcal{F}(\mathbf{y}^{(l)})}{\mathcal{F}\big{(}\Phi(\mathbf{x}^{(l)}).\Phi(\mathbf{x}^{(l)})\big{)}+\lambda},

\mathbf{r}^{(l)}=\mathcal{F}^{-1}\big{(}\tilde{\mathbf{A}}^{(l)}\odot\mathcal{F}(\Phi(\mathbf{z}^{(l)}).\Phi(\tilde{\mathbf{x}}^{(l)}))\big{)},

\mathbf{r}^{(l)}=\mathcal{F}^{-1}\big{(}\tilde{\mathbf{A}}^{(l)}\odot\mathcal{F}(\Phi(\mathbf{z}^{(l)}).\Phi(\tilde{\mathbf{x}}^{(l)}))\big{)},

r (m, n) = l \sum γ (l) r^{(l)} (m, n),

r (m, n) = l \sum γ (l) r^{(l)} (m, n),

(\overset{m}{^}, \overset{n}{^}) = m, n argmax r (m, n),

(\overset{m}{^}, \overset{n}{^}) = m, n argmax r (m, n),

\tilde{X}_{k}^{(l)}

\tilde{X}_{k}^{(l)}

\tilde{A}_{k}^{(l)}

K (x_{i}^{(l)}, x_{j}^{(l)}) = (x_{i}^{(l)})^{T} x_{j}^{(l)} = F^{- 1} (d \sum (X_{i, d}^{(l)})^{*} ⊙ X_{j, d}^{(l)}),

K (x_{i}^{(l)}, x_{j}^{(l)}) = (x_{i}^{(l)})^{T} x_{j}^{(l)} = F^{- 1} (d \sum (X_{i, d}^{(l)})^{*} ⊙ X_{j, d}^{(l)}),

\begin{array}[]{lll}K(\mathbf{x}^{(l)}_{i},\mathbf{x}^{(l)}_{j})=\Phi(\mathbf{x}^{(l)}_{i})^{T}\Phi(\mathbf{x}^{(l)}_{j})=\exp{\big{(}-\frac{|\mathbf{x}^{(l)}_{i}-\mathbf{x}^{(l)}_{j}|^{2}}{\sigma^{2}}\big{)}}=\\ \exp{\bigg{(}-\frac{1}{\sigma^{2}}\big{(}\|\mathbf{x}^{(l)}_{i}\|^{2}+\|\mathbf{x}^{(l)}_{j}\|^{2}-\mathcal{F}^{-1}(\sum_{d}(\mathbf{X}^{(l)}_{i,d})^{*}\odot\mathbf{X}^{(l)}_{j,d})\big{)}\bigg{)}},\end{array}

\begin{array}[]{lll}K(\mathbf{x}^{(l)}_{i},\mathbf{x}^{(l)}_{j})=\Phi(\mathbf{x}^{(l)}_{i})^{T}\Phi(\mathbf{x}^{(l)}_{j})=\exp{\big{(}-\frac{|\mathbf{x}^{(l)}_{i}-\mathbf{x}^{(l)}_{j}|^{2}}{\sigma^{2}}\big{)}}=\\ \exp{\bigg{(}-\frac{1}{\sigma^{2}}\big{(}\|\mathbf{x}^{(l)}_{i}\|^{2}+\|\mathbf{x}^{(l)}_{j}\|^{2}-\mathcal{F}^{-1}(\sum_{d}(\mathbf{X}^{(l)}_{i,d})^{*}\odot\mathbf{X}^{(l)}_{j,d})\big{)}\bigg{)}},\end{array}

\overset{ˊ}{y}_{i} = {+ 1, - 1, if I O U (\overset{ˊ}{x}_{i}, \ddot{x}_{t}) \geq δ_{p} if I O U (\overset{ˊ}{x}_{i}, \ddot{x}_{t}) < δ_{n}

\overset{ˊ}{y}_{i} = {+ 1, - 1, if I O U (\overset{ˊ}{x}_{i}, \ddot{x}_{t}) \geq δ_{p} if I O U (\overset{ˊ}{x}_{i}, \ddot{x}_{t}) < δ_{n}

w, b, \boldmath ξ min \frac{1}{2} ∣∣ w ∣ ∣^{2} + C i = 1 \sum N ξ_{i}^{p}

w, b, \boldmath ξ min \frac{1}{2} ∣∣ w ∣ ∣^{2} + C i = 1 \sum N ξ_{i}^{p}

y_{i} (w .Φ (x_{i}) + b) \geq 1 - ξ_{i}, ξ_{i} \geq 0 \forall i \in {1, ..., N} .

y_{i} (w .Φ (x_{i}) + b) \geq 1 - ξ_{i}, ξ_{i} \geq 0 \forall i \in {1, ..., N} .

0 \leq a_{i} \leq C min W = \frac{1}{2} i, j = 1 \sum N a_{i} Q_{ij} a_{j} - i = 1 \sum N a_{i} + b i = 1 \sum N y_{i} a_{i},

0 \leq a_{i} \leq C min W = \frac{1}{2} i, j = 1 \sum N a_{i} Q_{ij} a_{j} - i = 1 \sum N a_{i} + b i = 1 \sum N y_{i} a_{i},

\frac{\partial W}{\partial a _{i}} = j = 1 \sum N Q_{ij} a_{j} + y_{i} b - 1 ⎩ ⎨ ⎧ > 0, if a_{i} = 0 = 0, if 0 \leq a_{i} \leq C < 0, if a_{i} = C,

\frac{\partial W}{\partial a _{i}} = j = 1 \sum N Q_{ij} a_{j} + y_{i} b - 1 ⎩ ⎨ ⎧ > 0, if a_{i} = 0 = 0, if 0 \leq a_{i} \leq C < 0, if a_{i} = C,

\frac{\partial W}{\partial b} = j = 1 \sum N y_{j} a_{j} = 0,

\frac{\partial W}{\partial b} = j = 1 \sum N y_{j} a_{j} = 0,

w = i \in S_{1} \cup S_{2} \sum a_{i} y_{i} Φ (x_{i}),

w = i \in S_{1} \cup S_{2} \sum a_{i} y_{i} Φ (x_{i}),

y_{k ∣ k - 1} (x ∣ ζ) = N (x; F_{k - 1} ζ, Q_{k - 1})

y_{k ∣ k - 1} (x ∣ ζ) = N (x; F_{k - 1} ζ, Q_{k - 1})

f_{k} (z ∣ x) = N (z; H_{k} x, R_{k})

f_{k} (z ∣ x) = N (z; H_{k} x, R_{k})

γ_{k} (x) = v = 1 \sum V_{γ, k} w_{γ, k}^{(v)} N (x; m_{γ, k}^{(v)}, P_{γ, k}^{(v)})

γ_{k} (x) = v = 1 \sum V_{γ, k} w_{γ, k}^{(v)} N (x; m_{γ, k}^{(v)}, P_{γ, k}^{(v)})

D_{k - 1} (x) = v = 1 \sum V_{k - 1} w_{k - 1}^{(v)} N (x; m_{k - 1}^{(v)}, P_{k - 1}^{(v)}),

D_{k - 1} (x) = v = 1 \sum V_{k - 1} w_{k - 1}^{(v)} N (x; m_{k - 1}^{(v)}, P_{k - 1}^{(v)}),

D_{k ∣ k - 1} (x) = D_{S, k ∣ k} (x) + γ_{k} (x),

D_{k ∣ k - 1} (x) = D_{S, k ∣ k} (x) + γ_{k} (x),

\begin{array}[]{lll}\mathcal{D}_{S,k|k-1}(x)=&p_{s,k}\sum_{v=1}^{V_{k-1}}w_{k-1}^{(v)}\mathcal{N}(x;m_{S,k|k-1}^{(v)},P_{S,k|k-1}^{(v)}),\end{array}

\begin{array}[]{lll}\mathcal{D}_{S,k|k-1}(x)=&p_{s,k}\sum_{v=1}^{V_{k-1}}w_{k-1}^{(v)}\mathcal{N}(x;m_{S,k|k-1}^{(v)},P_{S,k|k-1}^{(v)}),\end{array}

m_{S, k ∣ k - 1}^{(v)} = F_{k - 1} m_{k - 1}^{(v)},

m_{S, k ∣ k - 1}^{(v)} = F_{k - 1} m_{k - 1}^{(v)},

P_{S, k ∣ k - 1}^{(v)} = Q_{k - 1} + F_{k - 1} P_{k - 1}^{(v)} F_{k - 1}^{T},

P_{S, k ∣ k - 1}^{(v)} = Q_{k - 1} + F_{k - 1} P_{k - 1}^{(v)} F_{k - 1}^{T},

D_{k ∣ k - 1} (x) = v = 1 \sum V_{k ∣ k - 1} w_{k ∣ k - 1}^{(v)} N (x; m_{k ∣ k - 1}^{(v)}, P_{k ∣ k - 1}^{(v)}),

D_{k ∣ k - 1} (x) = v = 1 \sum V_{k ∣ k - 1} w_{k ∣ k - 1}^{(v)} N (x; m_{k ∣ k - 1}^{(v)}, P_{k ∣ k - 1}^{(v)}),

D_{k ∣ k} (x) = (1 - p_{D, k}) D_{k ∣ k - 1} (x) + z \in Z_{k} \sum D_{D, k} (x; z),

D_{k ∣ k} (x) = (1 - p_{D, k}) D_{k ∣ k - 1} (x) + z \in Z_{k} \sum D_{D, k} (x; z),

D_{D, k} (x; z) = v = 1 \sum V_{k ∣ k - 1} w_{k}^{(v)} (z) N (x; m_{k ∣ k}^{(v)} (z), P_{k ∣ k}^{(v)}),

D_{D, k} (x; z) = v = 1 \sum V_{k ∣ k - 1} w_{k}^{(v)} (z) N (x; m_{k ∣ k}^{(v)} (z), P_{k ∣ k}^{(v)}),

w_{k}^{(v)} (z) = \frac{p _{D, k} w _{k ∣ k - 1}^{(v)} q _{k}^{(v)} ( z )}{c _{s_{k}} ( z ) + p _{D, k} \sum _{l = 1}^{V_{k ∣ k - 1}} w _{k ∣ k - 1}^{(l)} q _{k}^{(l)} ( z )},

w_{k}^{(v)} (z) = \frac{p _{D, k} w _{k ∣ k - 1}^{(v)} q _{k}^{(v)} ( z )}{c _{s_{k}} ( z ) + p _{D, k} \sum _{l = 1}^{V_{k ∣ k - 1}} w _{k ∣ k - 1}^{(l)} q _{k}^{(l)} ( z )},

q_{k}^{(v)} (z) = N (z; H_{k} m_{k ∣ k - 1}^{(v)}, R_{k} + H_{k} P_{k ∣ k - 1}^{(v)} H_{k}^{T}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSupport Vector Machine

Full text

Long-term Correlation Tracking using Multi-layer Hybrid Features in Sparse and Dense Environments

Nathanael L. Baisa, Deepayan Bhowmik, and Andrew Wallace N. L. Baisa and A. Wallace are with the Department of Electrical, Electronic and Computer Engineering, Heriot Watt University, Edinburgh EH14 4AS, United Kingdom. (e-mail: {nb30, a.m.wallace}@hw.ac.uk). D. Bhowmik is with the Department of Computing, Sheffield Hallam University, Sheffield S1 1WB, United Kingdom.(e-mail: [email protected])

Abstract

Tracking a target of interest in both sparse and crowded environments is a challenging problem, not yet successfully addressed in the literature. In this paper, we propose a new long-term visual tracking algorithm, learning discriminative correlation filters and using an online classifier, to track a target of interest in both sparse and crowded video sequences. First, we learn a translation correlation filter using a multi-layer hybrid of convolutional neural networks (CNN) and traditional hand-crafted features. We combine advantages of both the lower convolutional layer which retains more spatial details for precise localization and the higher convolutional layer which encodes semantic information for handling appearance variations, and then integrate these with histogram of oriented gradients (HOG) and color-naming traditional features. Second, we include a re-detection module for overcoming tracking failures due to long-term occlusions by training an incremental (online) SVM on the most confident frames using hand-engineered features. This re-detection module is activated only when the correlation response of the object is below some pre-defined threshold. This generates high score detection proposals which are temporally filtered using a Gaussian mixture probability hypothesis density (GM-PHD) filter to find the detection proposal with the maximum weight as the target state estimate by removing the other detection proposals as clutter. Finally, we learn a scale correlation filter for estimating the scale of a target by constructing a target pyramid around the estimated or re-detected position using the HOG features. We carry out extensive experiments on both sparse and dense data sets which show that our method significantly outperforms state-of-the-art methods.

Index Terms:

Visual tracking, Correlation filter, CNN features, Hybrid features, Online learning, GM-PHD filter

I Introduction

Visual target tracking is one of the most important and active research areas in computer vision with a wide range of applications like surveillance, robotics and human-computer interaction (HCI). Although it has been studied extensively during past decades as recently surveyed in [1][2], object tracking is still a difficult problem due to many challenges that cause significant appearance changes of targets such as varying illumination, occlusion, pose variations, deformation, abrupt motion, background clutter, and high target densities (in crowded environments). Robust representation of target appearance is important to overcome these challenges.

Recently, convolutional neural network (CNN) features have demonstrated outstanding results on various recognition tasks [3, 4]. Motivated by this, a few deep learning based trackers [5, 6] have been developed. In addition, discriminative correlation filter-based trackers have achieved state-of-the-art results as surveyed in [7] in terms of both efficiency and robustness due to three reasons. First, efficient correlation operations are performed by replacing exhausted circular convolutions with element-wise multiplications in the frequency domain which can be computed using the fast Fourier transform (FFT) with very high speed. Second, thousands of negative samples around the target’s environment can be efficiently incorporated through circular-shifting with the help of a circulant matrix. Third, training samples are regressed to soft labels of a Gaussian function (Gaussian-weighted labels) instead of binary labels alleviating sampling ambiguity. In fact, regression with class labels can be seen as classification. However, correlation filter-based trackers are susceptible to long-term occlusions.

In addition, the Gaussian mixture probability hypothesis density (GM-PHD) filter [8] has an in-built capability of removing clutter while filtering targets with very efficient speed without the need for explicit data association. Though this filter is designed for multi-target filtering, it is even preferable for single target filtering in scenes with challenging background clutter as well as clutter that comes from other targets not of current concern. This filtering approach is flexible, for instance, it has been extended for multiple targets of different types in [9][10].

In this work, we mainly focus on long-term tracking of a target of interest in sparse as well as crowded environments where the unknown target is initialized by a bounding box and then tracked in subsequent frames. Without making any constraint on the video scene, we develop a novel long-term online tracking algorithm that can close the research gap between sparse and crowded scenes tracking problems using the advantages of the correlation filters, a hybrid of multi-layer CNN and hand-crafted features, an incremental (online) support vector machine (SVM) classifier and a Gaussian mixture probability hypothesis density (GM-PHD) filter. To the best of our knowledge, nobody has adopted this approach.

The main contributions of this paper are as follows:

We integrate a hybrid of multi-layer CNN and traditional hand-crafted features for learning a translation correlation filter for estimating the target position in the next frame by extending a ridge regression for multi-layer features. 2. 2.

We include a re-detection module to re-initialize the tracker in case of tracking failures due to long-term occlusions by learning an incremental SVM from the most confident frames using hand-crafted features to generate high score detection proposals. 3. 3.

We incorporate a GM-PHD filter to temporally filter detection proposals generated from the learned online SVM to find the detection proposal with the maximum weight as the target position estimate by removing the other detection proposals as clutter. 4. 4.

We learn a scale correlation filter by constructing a target pyramid at the estimated or re-detected position using HOG features to estimate the scale of the detected target.

We presented a preliminary idea of this work in [11]. In this work, we make more elaborate descriptions of our algorithm. Besides, we include a scale estimation at the estimated target position as well as an extended experiment on a large-scale online object tracking benchmark (OOTB) in addition to the PETS 2009 data sets.

The rest of this paper is organized as follows. In section II, related work is discussed. An overview of our algorithm and the proposed algorithm in detail are described in sections III and IV, respectively. In section V, the implementation details with parameter settings is briefly discussed. The experimental results are analyzed and compared in section VI. The main conclusions and suggestions for future work are summarized in section VII.

II Related Work

Various visual tracking algorithms have been proposed over the past decades to cope with tracking challenges, and they can be categorized into two types depending on the learning strategies: generative and discriminative methods. Generative methods describe the target appearances using generative models and search for target regions that best-fit the models i.e. search for the best-matching windows (patches). Various generative target appearance modelling algorithms have been proposed such as online density estimation [12], sparse representation [13, 14], and incremental subspace learning [15]. On the other hand, discriminative methods build a model that distinguishes the target from the background. These algorithms typically learn classifiers based on online boosting [16], multiple instance learning [17], P-N learning [18], transfer learning [19], structured output SVMs [20] and combining multiple classifiers with different learning rates [21]. Background information is important for effective tracking as explored in [22][23] which means that more competing approaches are discriminative methods [24] though hybrid generative and discriminative models can also be used [25][26]. However, sampling ambiguity is one of the big problems in discriminative tracking methods which results in drifting. Recently, correlation filters [27, 28, 29] have been introduced for online target tracking that can alleviate this sampling ambiguity. Previously, the large training data required to train correlation filters prevented them from application to online visual tracking though correlation filters are effective for localization tasks. However, recently all the circular-shifted versions of input features have been considered with the help of a circulant matrix producing a large number training samples [27, 28].

There are many strong sides of correlation methods such as inherent parallelism, shift (translation) invariance, noise robustness, and high discrimination ability [30]. Both digital and optical correlators are discussed in detail in [31] though more emphasis is given to optical correlators. Performance optimization of the correlation filters by pre-processing the input target image was introduced in [32]. Recent research trends of correlation filters for various applications with more emphasis on face recognition (and object tracking) is given in [30]. Due to the effectiveness of the correlation methods, they have been successfully applied to many domains such as swimmer tracking [33], pose invariant face recognition [34], road sign identification for advanced driver assistance [35], etc. Some types of correlation filters are sensitive to challenges such as rotation, illumination changes, occlusion, etc. For instance, the Phase-Only Filter (POF) is sensitive to changes in rotation, illumination changes, occlusion, scale and noise contained in targets of interest [32] though it can give very narrow correlation peaks (good localization); a pre-processing step was used to make it invariant to illumination in [34]. Recent correlation filters such as KCF [28] are more suitable for online tracking by generating a large number of training samples from input features using a circulant matrix and are more robust to the tracking challenges such as rotation, illumination changes, partial occlusion, deformation, fast motion, etc (as shown on its results section in [28]) than its previous counterparts [30]. Using CNN features has even improved the online tracking results as shown in [36] against these tracking challenges, however, log-term occlusion is still a problem in correlation filter-based tracking approaches.

There are three tracking scenarios that are important to consider: short-term tracking, long-term tracking, and tracking in a crowded scene. If objects are visible over the whole course of the sequences, short-term model-free tracking algorithms are sufficient to track a single object without applying a pre-trained model of target appearance. There are many short-term tracking algorithms developed in the literature [1][7] such as online density estimation [12], context-learning [37], scale estimation [29], and using features from multiple CNN layers [36, 38]. However, these short-term tracking algorithms can not re-initialize the trackers once they fail due to long-term occlusions and confusions from background clutter.

Long-term tracking algorithms are important in video streams that are of indefinite length and have long-term occlusions. A Tracking-Learning-Detection (TLD) algorithm has been developed in [18] which explicitly decomposes the long-term tracking task into tracking, learning and detection. In this case, the tracker tracks the target from frame to frame and provides training data for the detector which re-initializes the tracker when it fails. The learning component estimates the detector’s errors and then updates it for correction in the future. This algorithm works well in very sparse video (video sequences with few targets) but is sensitive to background clutter. Long-term correlation tracking (LCT), developed in [39], learns three different discriminative correlation filters: translation, appearance and scale correlation filters using hand-crafted features. Even though it includes a re-detection module by learning the random ferns classifier online for re-initializing a tracker in case of tracking failures, it is not robust to long-term occlusions and background clutter. Multi-domain network (MDNet) [40] pre-trains a CNN network composed of shared layers and multiple domain-specific layers using a large set of videos to get generic target representations in the shared layers. This proposed network has separate branches of domain-specific layers for binary classification to identify the target in each domain. However, when applied to fundamentally different videos other than the related videos on which it was trained, it gives poorer results.

Tracking of a target in a crowded scene is very challenging due to long-term occlusion, many targets with appearance variation and high clutter. Person detection and tracking in crowds is formulated as a joint energy minimization problem by combining crowd density estimation and localization of individual person in [41]. Although this approach does not require manual initialization, it has low performance for tracking a generic target of interest as it was mainly developed for tracking human heads. The method developed in [42] trained Hidden Markov Models (HMMs) on motion patterns within a scene to capture spatial and temporal variations of motion in the crowd which is used for tracking individuals. However, this approach is limited to a crowd with structured pattern i.e. it needs some prior knowledge about the scene. The algorithm developed in [43] used visual information (prominence) and spatial context (influence from neighbours) to develop online tracking in crowded scene without using any prior knowledge about the scene, unlike the method in [42] which uses some training data from the past as well as the future. This algorithm performs well in highly crowded scenes but has low performance in a less crowded scenes as influence from neighbours (spatial context) decreases.

In conclusion, although there are many effective algorithms that handle appearance variation, occlusion and high clutter in short and long-term video sequences, no single approach is wholly effective in all scenarios. Our proposed tracking algorithm tracks a target of interest in both sparse and dense environments without using any constraint from the video scene using correlation filters, sophisticated features and re-detection scheme particularly robust to sparse as well as highly occluded and cluttered dense scenes.

III Overview of Our Algorithm

We develop a novel long-term online tracking algorithm that can be applied to both sparse and dense environments by learning correlation filters using a hybrid of multi-layer CNN and hand-crafted features as well as including a re-detection module using an incremental SVM and GM-PHD filter.

Accordingly, to develop an online long-term tracking algorithm robust to appearance variations in both sparse and crowded scenes, we learn two different correlation filters: a translation correlation filter ( $\mathbf{w}_{t}$ ) and a scale correlation filter ( $\mathbf{w}_{s}$ ). A translation correlation filter is learned using a hybrid of multi-layer CNN features from VGG-Net [4] and robust traditional hand-crafted features.

For the CNN part, we combine features from both a lower convolutional layer which retains more spatial details for precise localization and a higher convolutional layer which encodes semantic information for handling appearance variations. This makes layer 1, layer 2 and layer 3 in multi-layer features with multiple channels (512, 512 and 256 dimensions) in each layer, respectively. Since the spatial resolution of the extracted features gradually reduces with the increase of the depth of CNN layers due to pooling operators, it is crucial to resize each feature map to a fixed size using bilinear interpolation.

For the traditional features part, we use a histogram of oriented gradients (HOG), in particular Felzenszwalb’s variant [44] and color-naming [45] features for capturing image gradients and color information, respectively. These integrated traditional features were used for object detection in [46][47] giving promising results. Color-naming is the linguistic color label assigned by human to describe the color, hence, the mapping method in [45] is employed to convert the RGB space into the color name space which is an 11 dimensional color representation providing the perception of a target color. By aligning the feature size of the HOG variant with 31 dimensions and color-naming with 11 dimensions, they are integrated to make 42 dimensional features which make a 4th layer in our hybrid multi-layer features.

For scale estimation, we learn a scale correlation filter using only HOG features, in particular Felzenszwalb’s variant [44]. Besides, we incorporate a re-detection module by learning an incremental SVM from the most confident frames determined by maximal value of correlation response map using HOG, LUV color and normalized gradient magnitude features for generating high-score detection proposals which are filtered using the GM-PHD filter to re-acquire the target in case of tracking failures. The flowchart of our method is given in Fig. 1 and the outline of our proposed algorithm is given in Algorithm 1.

IV Proposed Algorithm

This section describes our proposed tracking algorithm which has four distinct functional parts: a) correlation filters formulated for multi-layer hybrid features, b) an online SVM detector developed for generating high score detection proposals, c) a GM-PHD filter for finding the detection proposal with maximum weight to re-initialize the tracker in case of tracking failures by removing the other detection proposals as clutter, and d) a scale estimation method for estimating the scale of a target by constructing image pyramid at the estimated target position.

IV-A Correlation Filters for Multi-layer Features

To track a target using correlation filters, the appearance of the target should be modeled using a correlation filter $\mathbf{w}$ which can be trained on feature vector $\mathbf{x}$ of size $M\times N\times D$ extracted from an image patch where M, N, and D indicates the width, height and number of channels, respectively. This feature vector $\mathbf{x}$ can be extracted from multiple layers, for example in the case of CNN features and/or traditional hand-crafted features, therefore, we denote it as $\mathbf{x}^{(l)}$ to designate from which layer $l$ it is extracted. All the circular shifts of $\mathbf{x}^{(l)}$ along the $M$ and $N$ dimensions are considered as training examples where each circularly shifted sample $\mathbf{x}^{(l)}_{m,n},m\in\{0,1,...,M-1\},n\in\{0,1,...,N-1\}$ has a Gaussian function label $y^{(l)}(m,n)$ given by

[TABLE]

where $\sigma$ is the kernel width. Hence, $y^{(l)}(m,n)$ is a soft label rather than a binary label. To learn the correlation filter $\mathbf{w}^{(l)}$ for layer $l$ with the same size as $\mathbf{x}^{(l)}$ , we extend a ridge regression [48][49], developed for a single-layer feature vector, to be used for multi-layer hybrid features with layer $l$ as

[TABLE]

where $\Phi$ denotes the mapping to a kernel space and $\lambda$ is a regularization parameter ( $\lambda\geq 0$ ). The solution $\mathbf{w}^{(l)}$ can be expressed as

[TABLE]

This alternative representation makes the dual space $\mathbf{a}^{(l)}$ the variable under optimization instead of the primal space $\mathbf{w}^{(l)}$ .

Training phase: The training phase is performed in the Fourier domain using the fast Fourier transform (FFT) to compute the coefficient $\mathbf{A}^{(l)}$ as

[TABLE]

where $\mathcal{F}$ denotes the FFT operator.

Detection phase: The detection phase is performed on the new frame given an image patch (search window) which is used as a temporal context i.e. the search window is larger than the target to provide some context. If feature vector $\mathbf{z}^{(l)}$ of size $M\times N\times D$ is extracted from this image patch, the response map $(\mathbf{r}^{(l)})$ is computed as

[TABLE]

where $\tilde{\mathbf{A}}^{(l)}$ and $\tilde{\mathbf{x}}^{(l)}=\mathcal{F}^{-1}(\tilde{\mathbf{X}}^{(l)})$ denote the learned target appearance model for layer $l$ , operator $\odot$ is the Hadamard (element-wise) product, and $\mathcal{F}^{-1}$ is the inverse FFT. Now, the response maps of all layers are summed according to their weight $\gamma(l)$ element-wise as

[TABLE]

The new target position is estimated by finding the maximum value of $\mathbf{r}(m,n)$ as

[TABLE]

Model update: The model is updated by training a new model at the new target position and then linearly interpolating the obtained values of the dual space coefficients $\mathbf{A}^{(l)}_{k}$ and the base data template $\mathbf{X}^{(l)}_{k}=\mathcal{F}(\mathbf{x}^{(l)}_{k})$ with those from the previous frame to make the tracker more adaptive to target appearance variations.

[TABLE]

where $k$ is the index of the current frame, and $\eta$ is the learning rate.

The mappings to the kernel space $(\Phi)$ used in Eq. (4) and Eq. (5) can be expressed using a kernel function as $K(\mathbf{x}^{(l)}_{i},\mathbf{x}^{(l)}_{j})=\Phi(\mathbf{x}^{(l)}_{i}).\Phi(\mathbf{x}^{(l)}_{j})=\Phi(\mathbf{x}^{(l)}_{i})^{T}\Phi(\mathbf{x}^{(l)}_{j})$ . If the computation is performed in the frequency domain, the normal transpose should be replaced by the Hermitian transpose i.e. $\Phi(\mathbf{X}^{(l)}_{i})^{H}=(\Phi(\mathbf{X}^{(l)}_{i})^{*})^{T}$ where the star ( $*$ ) denotes the complex conjugate.

Thus, for a linear kernel,

[TABLE]

where $\mathbf{X}^{(l)}_{i}=\mathcal{F}(\mathbf{x}^{(l)}_{i})$ .

and for a Gaussian kernel,

[TABLE]

This formulation is generic for multiple channel features from multiple layers as in the case of our multi-layer hybrid features, i.e. where $\mathbf{X}^{(l)}_{i,d},~{}d\in\{1,...,D\},~{}l\in\{1,...,L\}$ . This is an extended version of the one given in [28] that takes into account features from multiple layers. The linearity of the FFT allows us to simply sum the individual dot-products for each channel $d\in\{1,...,D\}$ in each layer $l\in\{1,...,L\}$ .

IV-B Online Detector

We include a re-detection module, $D_{r}$ , to generate high score detection proposals in case of tracking failures due to long-term occlusions. Instead of using a correlation filter to scan across the entire frame which is computationally expensive and less efficient, we learn an incremental (online) SVM [50] by generating a set of samples in the search window around the estimated target position from the most confident frames, and scan through the window when the re-detection is activated to generate high-score detection proposals. These most confident frames are determined by the maximum translation correlation response in the current frame i.e. if the maximum correlation response of an image patch is above the trained detector threshold ( $T_{td}$ ), we generate samples around this image patch and train the detector. This detector is activated to generate high score detection proposals if the maximum of the correlation response becomes below the activated re-detection threshold ( $T_{rd}$ ). We use HOG (particularly Felzenszwalb’s variant [44]), LUV color and normalized gradient magnitude features to train this online SVM classifier. We use different visual features for computational feasibility from the ones we use for learning the translation correlation filter since we can select the feature representation for each module independently [29, 39].

We want to update a weight vector $\mathbf{w}$ of the SVM provided a set of samples with associated labels, $\{(\acute{\mathbf{x}}_{i},\acute{\mathbf{y}}_{i})\}$ , obtained from the current results. The label $\acute{\mathbf{y}}_{i}$ of a new example $\acute{\mathbf{x}}_{i}$ is given by

[TABLE]

where $IOU(.)$ is the intersection over union (overlap ratio) of a new example $\acute{\mathbf{x}}_{i}$ and the estimated target bounding box in the current most confident frame $\ddot{\mathbf{x}}_{t}$ . The samples with the bounding box overlap ratios between the thresholds $\delta_{n}$ and $\delta_{p}$ are excluded from the training set for avoiding the drift problem.

SVM classifiers of the form $f(\mathbf{x})=\mathbf{w}.\Phi(\mathbf{x})+b$ are learned from the data $\{(\mathbf{x}_{i},\mathbf{y}_{i})\in\Re^{m}\times\{-1,+1\}\forall i\in\{1,...,N\}\}$ by minimizing

[TABLE]

for $p\in\{1,2\}$ subject to the constraints

[TABLE]

Hinge loss ( $p=1$ ) is preferred due to its improved robustness to outliers over the quadratic loss ( $p=2$ ). Thus, the offline SVM learns a weight vector $\mathbf{w}=(w_{1},w_{2},....,w_{N})^{T}$ by solving this quadratic convex optimization problem (QP) which can be expressed in its dual form as

[TABLE]

where $\{a_{i}\}$ are Lagrange multipliers, $b$ is bias, $C$ is regularization parameter, and $Q_{ij}=y_{i}y_{j}K(\mathbf{x_{i}},\mathbf{x_{j}})$ . The kernel function $K(\mathbf{x_{i}},\mathbf{x_{j}})=\Phi(\mathbf{x_{i}}).\Phi(\mathbf{x_{j}})$ is used to implicitly map into a higher dimensional feature space and compute the dot product. It is not straightforward for conventional QP solvers to handle the optimization problem in Eq. (14) for online tracking tasks as the training data are provided sequentially, not at once. Incremental SVM [50] is tailored for such cases which retains Karush-Kuhn-Tucker (KKT) conditions on all the existing examples while updating the model with a new example so that the exact solution at each increment of dataset can be guaranteed. KKT conditions are the first-order necessary conditions for the optimal unique solution of dual parameters $\{a,b\}$ which minimizes Eq. (14) and are given by

[TABLE]

Based on the partial derivative $m_{i}=\frac{\partial W}{\partial a_{i}}$ which is related to the margin of the i-th example, each training example can be categorized into three: $\mathcal{S}_{1}$ support vectors lying on the margin ( $m_{i}=0$ ), $\mathcal{S}_{2}$ support vectors lying inside the margin ( $m_{i}<0$ ), and the remaining $\mathcal{R}$ reserve vectors (non-support vectors). During incremental learning, new examples with $m_{i}\leq 0$ eventually become margin ( $\mathcal{S}_{1}$ ) or error ( $\mathcal{S}_{2}$ ) support vectors. However, the remaining new training examples become reserve vectors as they do not enter the solution so that the Lagrangian multipliers ( $a_{i}$ ) are estimated while retaining the KKT conditions. Given the updated Lagrangian multipliers, the weight vector $\mathbf{w}$ is given by

[TABLE]

It is important to keep only a fixed number of support vectors with the smallest margins for efficiency during online tracking.

Thus, using the trained incremental SVM, we generate high score detections as detection proposals during the re-detection stage. These are filtered using the GM-PHD filter to find the best possible detection that can re-initialize the tracker.

IV-C Temporal Filtering using the GM-PHD Filter

Once we generate high score detection proposals using the learned online SVM classifier during the re-detection stage, we need to find the most probable detection proposal for the target state (position) estimate by finding the detection proposal with the maximum weight using the GM-PHD filter [8]. Though the GM-PHD filter is designed for multi-target filtering with the assumptions of a linear Gaussian system, in our problem (re-detecting a target in cluttered scene), it is used for removing clutter that comes from background scene and other targets not of interest as it is equipped with such a capability. Besides, it provides motion information for the tracking algorithm. More importantly, using the GM-PHD filter to find the detection with the maximum weight from the generated high score detection proposals is more robust than relying only on the maximum score of the classifier.

The detected position of the target in each frame is filtered using the GM-PHD filter, but without re-fining the position states until the re-detection module is activated. This updates the weight of the GM-PHD filter corresponding to a target of interest giving sufficient prior information to be picked up during the re-detection stage among candidate high score detection proposals. If the re-detection module is activated (correlation response of the target becomes below a pre-defined threshold), we generate high score detection proposals (in this case 5) from the trained SVM classifier which are then filtered using the GM-PHD filter. The Gaussian component with the maximum weight is selected as the position estimate, and if the correlation response of this estimated position is greater than the pre-defined threshold, the estimated position of the target is re-fined.

The GM-PHD filter has two steps: prediction and update. Before stating these two steps, certain assumptions are needed: 1) each target follows a linear Gaussian model:

[TABLE]

where $\mathcal{N}(.;m,P)$ denotes a Gaussian density with mean $m$ and covariance $P$ ; $F_{k-1}$ and $H_{k}$ are the state transition and measurement matrices, respectively. $Q_{k-1}$ and $R_{k}$ are the covariance matrices of the process and the measurement noises, respectively. 2) A current measurement driven birth intensity inspired by but not identical to [51] is introduced at each time step, removing the need for the prior knowledge (specification of birth intensities) or a random model, with a non-informative zero initial velocity. The intensity of the spontaneous birth RFS is a Gaussian mixture of the form

[TABLE]

where $V_{\gamma,k}$ is the number of birth Gaussian components, $w_{\gamma,k}^{(v)}$ is the weight accompanying the Gaussian component $v$ , $m_{\gamma,k}^{(v)}$ is the current measurement and zero initial velocity used as mean, and $P_{\gamma,k}^{(v)}$ is birth covariance for Gaussian component $v$ . In our case, $V_{\gamma,k}$ equals to 1 unless in re-detection stage at which it becomes 5 as we generate 5 high score detection proposals to be filtered.

The survival and detection probabilities are independent of the target state: $p_{s,k}(x_{k})=p_{s,k}$ and $p_{D,k}(x_{k})=p_{D,k}$ .

Prediction: It is assumed that the posterior intensity at time $k-1$ is a Gaussian mixture of the form

[TABLE]

where $V_{k-1}$ is the number of Gaussian components of $\mathcal{D}_{k-1}(x)$ and it equals to the number of Gaussian components after pruning and merging at the previous iteration. Under these assumptions, the predicted intensity at time $k$ is given by

[TABLE]

where

[TABLE]

where $\gamma_{k}(x)$ is given by Eq. (20).

Since $\mathcal{D}_{S,k|k-1}(x)$ and $\gamma_{k}(x)$ are Gaussian mixtures, $\mathcal{D}_{k|k-1}(x)$ can be expressed as a Gaussian mixture of the form

[TABLE]

where $w_{k|k-1}^{(v)}$ is the weight accompanying the predicted Gaussian component $v$ , and $V_{k|k-1}$ is the number of predicted Gaussian components and it equals to the number of born targets (1 unless in case of re-detection at which it is 5) and the number of persistent components which are actually the number of Gaussian components after pruning and merging at the previous iteration.

Update: The posterior intensity (updated PHD) at time $k$ is also a Gaussian mixture and is given by

[TABLE]

where

[TABLE]

The clutter intensity due to the scene, $c_{s_{k}}(z)$ , in Eq. (24) is given by

[TABLE]

where $c(.)$ is the uniform density over the surveillance region $A$ , and $\lambda_{c}$ is the average number of clutter returns per unit volume i.e. $\lambda_{t}=\lambda_{c}A$ . We set the clutter rate or false positive per image (fppi) $\lambda_{t}=4$ in our experiment.

After update, weak Gaussian components with weight $w_{k}^{(v)}<T=10^{-5}$ are pruned, and Gaussian components with Mahalanobis distance less than $U=4$ pixels from each other are merged. These pruned and merged Gaussian components are predicted as existing targets in the next iteration. Finally, the Gaussian component of the posterior intensity with mean corresponding to the maximum weight is selected as a target state (position) estimate when the re-detection module is activated.

IV-D Scale Estimation

At the new estimated target position (or re-fined target position after re-detection in case of tracking failure), we construct an image pyramid for estimating its scale. Given a target size of $P\times Q$ in a test frame, we generate $S$ number of scale levels at the new estimated position i.e. for each $n\in\{\lfloor-\frac{S-1}{2}\rfloor,\lfloor-\frac{S-3}{2}\rfloor,...,\lfloor\frac{S-1}{2}\rfloor\}$ , we extract an image patch $I_{s}$ of size $sP\times sQ$ centered at the new estimated target position, where scale $s=a^{n}$ and $a$ is the scale factor between the generated image pyramids. We uniformly resize all the generated image pyramids to $P\times Q$ again unlike [29], and extracted HOG features particularly Felzenszwalb’s variant [44] to construct the scale feature pyramid. Then, the optimal scale $\hat{s}$ of a target at the estimated new position can be obtained by computing the correlation response maps $\hat{\mathbf{r}}_{s}$ of the scale correlation filter $\mathbf{w}_{s}$ to $I_{s}$ and find the scale at which the maximum response map can be obtained as

[TABLE]

The scale correlation filter is updated using the new training sample at the estimated scale $I_{\hat{s}}$ by Eq. (8).

V Implementation Details

The main steps of our proposed algorithm are presented in Algorithm 1. More implementation details with parameter settings are given as follows. For learning the translation correlation filter, we extract features from VGG-Net [4], shown in Fig. 2, trained on a large amount of object recognition data set (ImageNet) [52] by first removing fully-connected layers. Particularly, we use outputs of conv3-4, conv4-4 and conv5-4 convolutional layers as features ( $l\in\{1,2,3\}$ and $d\in\{1,...,D\}$ ), i.e. the outputs of rectilinear units (inputs of pooling) layers must be used to keep more spatial resolution. Hence, the CNN features we use has 3 layers ( $L=3$ ) and multiple channels ( $D=512$ ) for conv5-4 and conv4-4 layers and ( $D=256$ ) for conv3-4 layer. For hand-crafted features, the HOG variant with 31 dimensions and color-naming with 11 dimensions are integrated to make 42 dimensional features which make a 4th layer in our hybrid multi-layer features. Given an image frame with a search window size of $\tilde{M}\times\tilde{N}$ which is about 2.8 times the target size to provide some context, we resize the multi-layer hybrid features to a fixed spatial size of $M\times N$ where $M=\frac{\tilde{M}}{4}$ and $N=\frac{\tilde{N}}{4}$ . These hybrid features from each layer are weighted by a cosine window [28] to remove the boundary discontinuities, and then combined later on in Eq. (6) for which we set $\gamma$ as 1, 0.4, 0.02 and 0.1 for the conv5-4, conv4-4, conv3-4 and hand-crafted features, respectively. We set the regularization parameter of the ridge regression in Eq. (2) to $\lambda=10^{-4}$ , and a kernel bandwidth of the Gaussian function label in Eq. (1) to $\sigma=0.1$ . The learning rate for model update in Eq. (8) is set to $\eta=0.01$ .

For learning the scale correlation filter, we use the same parameter settings as above with some exceptions as follows. In this case we use HOG features [44] with 31 bins i.e. it is treated as a single layer ( $L=1$ ) but with multiple channels ( $D=31$ ). The number of scale spaces is set to $S=31$ and the scale factor is set to $a=1.04$ . We use a linear kernel Eq. (9) for learning both translation and scale correlation filters.

HOG, LUV color and normalized gradient magnitude features are used to train an incremental (online) SVM classifier for the re-detection module. For the objective function given in Eq. (14), we use a Gaussian kernel, particularly for $Q_{ij}=y_{i}y_{j}K(\mathbf{x_{i}},\mathbf{x_{j}})$ , and the regularization parameter $C$ is set to 2. Empirically, we set the activated re-detection threshold to $T_{rd}=0.15$ and the trained detector threshold to $T_{td}=0.40$ . The parameters in Eq. (11) are set as $\delta_{p}=0.9$ and $\delta_{n}=0.3$ . For negative samples, we randomly sampled 3 times the number of positive samples satisfying $\delta_{n}=0.3$ within the maximum search area of 4 times the target size. In the re-detection phase, we generated 5 high-score detection proposals from the trained online SVM around the estimated position within the maximum search area of 6 times the target size which were filtered using the GM-PHD filter to find the detection with the maximum weight removing the others as clutter. The implementation parameters are summarized in Table I.

VI Experimental Results

We evaluate our proposed tracking algorithm on both a large-scale online object tracking benchmark (OOTB) [22] and crowded scenes (medium and dense PETS 2009 data sets111http://www.cvg.reading.ac.uk/PETS2009/a.html), and compared its performance with state-of-the-art trackers using the same parameter values for all the sequences. We quantitatively evaluate the robustness of the trackers using two metrics, precision and success rate based on center location error and bounding box overlap ratio, respectively, using one-pass evaluation (OPE) setting, running the trackers throughout a test sequence with initialization from the ground truth position in the first frame. The center location error computes the average Euclidean distance between the center locations of the tracked targets and the manually labeled ground truth positions of all the frames whereas bounding box overlap ratio computes the intersection over union of the tracked target and ground truth bounding boxes.

Our proposed tracking algorithm is implemented in MATLAB on a 3.0 GHz Intel Xeon CPU E5-1607 with 16 GB RAM. We also use the MatConvNet toolbox [53] for CNN feature extraction where its forward propagation computation is transferred to a NVIDIA Quadro K5000, and our tracker runs at 5 fps on this setting. The re-detection step and forward propagation for feature extraction step are the main computational load steps of our tracking algorithm. We analyze our algorithm and then compare it with the state-of-the-art trackers both quantitatively and qualitatively on OOTB and PETS 2009 data sets separately as follows.

VI-A Evaluation on OOTB

OOTB [22] contains 50 fully annotated videos with substantial variations such as scale, occlusion, illumination, etc and is currently a popular tracking benchmark available in the computer vision community. In this experiment, we compare our proposed tracking algorithm with 6 state-of-the-art trackers including CF2 [36], LCT [39], MEEM [21], DLT [5], KCF [28] and SAMF [47], as well as 4 more top trackers included in the Benchmark [22], particularly SCM [26], ASLA [14], TLD [18] and Struck [20] both quantitatively and qualitatively.

Quantitative Evaluation: We evaluate our proposed tracking algorithm quantitatively and compare with other algorithms as summarized in Fig. 3 using precision plots (left) and success plots (right) based on center location error and bounding box overlap ratio, respectively. Our proposed tracking algorithm, denoted by LCMHT, outperforms the state-of-the-art trackers in both precision and success measures by rankings given in the legends using a distance precision of threshold scores at 20 pixels and overlap success of area-under-curve (AUC) score for each tracker, respectively. This is because a hybrid of multi-layer CNN, HOG and color-naming features is more effective to represent the target than their individual features separately i.e. our proposed tracking algorithm integrates a hybrid of multi-layer CNN and traditional (HOG and color-naming) features for learning a translation correlation filter, and uses the GM-PHD filter for temporally filtering generated high score detection proposals during a re-detection phase for removing clutter so that it can re-detect the target even in a cluttered environment.

Attribute-based Evaluation: For the detailed performance analysis of each of the trackers, we also report the results on various challenge attributes in OOTB [22] such as occlusion, scale variation, illumination variation, etc. As shown in Fig. 4, our proposed tracker outperforms the state-of-the-art trackers in almost all challenge attributes. In particular, our proposed tracker (LCMHT) performs significantly better than all trackers on the occlusion attribute since it includes a re-detection module which can re-acquire the target in case the tracker fails even in cluttered environments by removing clutter using GM-PHD filter. Similarly, our tracker also outperforms other trackers on the scale variation attribute since our tracker elegantly estimates the scale of the tracker at the newly estimated target positions. The LCT algorithm includes both re-detection and scale estimation modules, however, our proposed tracker still outperforms the LCT algorithm by a large margin as shown in Fig. 4 since our tracker uses better visual features for translation estimation and re-detection. Furthermore, our proposed algorithm applies scale estimation after translation and re-detection steps (if activated) rather than only after the translation estimation step as in the LCT algorithm, though both methods use similar visual features (HOG) to learn the scale correlation filter.

Qualitative Evaluation: We compare our proposed tracking algorithm (LCMHT) with four other state-of-the-art trackers namely CF2 [36], MEEM [21], LCT [39] and KCF [28] on some challenging sequences of OOTB qualitatively as shown in Fig. 5. CF2 uses hierarchical CNN features but is not as effective as our tracker which combines hierarchical CNN features with HOG and color-naming traditional features as can be observed on the sequence Fleetface (first column on Fig. 5). LCT and KCF also use correlation filters using traditional features but still they are not as accurate as our tracker. MEEM uses many classifiers together to re-initialize the tracker in case of tracking failures but it can not re-detect the target on this sequence. Similarly, it can not re-detect the target on sequences Singer1 (second column), Freeman4 (third column) and Walking2 (forth column) as well. LCT includes re-detection and scale estimation components, however, it can not handle large scale changes as in sequence Singer1 (second column), and it can not re-initialize the tracker as in sequence Walking2 (forth column). More importantly, the sequence Freeman4 undergoes not only heavy occlusion in a cluttered environment but also scale variation, in-plane and out-of-plane rotations. The LCT algorithm which is equipped with both re-detection and scale estimation modules is not effective on this sequence like the other algorithms. However, only our proposed tracker tracks the target till the end of the sequence not only handling the scale change but also re-detecting the target when it fails. This sequence is a typical example which is related to our next evaluation on PETS 2009 data sets on which our proposed algorithm outperforms the other trackers by a large margin.

VI-B Evaluation on PETS 2009 Data Sets

We label the upper part (head + neck) of representative targets in both medium and dense PETS 2009 data sets to analyze our proposed tracking algorithm. In this experiment, our goal is to analyze our proposed tracking algorithm and other available state-of-the-art tracking algorithms to see whether they can successfully be applied for tracking a target of interest in occluded and cluttered environments. Accordingly, we compare our proposed tracking algorithm with 6 state-of-the-art trackers including CF2 [36], LCT [39], MEEM [21], DSST [29], KCF [28] and SAMF [47], as well as 4 more top trackers included in the Benchmark [22], particularly SCM [26], ASLA [14], CSK [27] and IVT [15] both quantitatively and qualitatively.

Quantitative Evaluation: The evaluation results of precision plots (left) and success plots (right) based on center location error and bounding box overlap ratio, respectively, are shown in Fig. 6. Our proposed tracking algorithm, denoted by LCMHT, outperforms the state-of-the-art trackers by a large margin on PETS 2009 data sets in both precision and success rate measures. The rankings are given in distance precision of threshold scores at 20 pixels and overlap success of AUC score for each tracker as given in the legends.

The second and third ranked trackers are CF2 [36] and MEEM [21] for precision plots, respectively, and viceversa for success plots on PETS 2009 data sets. However, on OOTB, CF2 outperforms MEEM significantly being second to our proposed tracking algorithm. The most important thing to give attention is on the performance of LCT [39]. This algorithm is ranked third on the OOTB as shown in Fig. 3, however, it performs least well on the precision plots and second from last on success plots on PETS 2009 data sets. Surprisingly, this algorithm was developed by learning three different discriminative correlation filters and even included a re-detection module for long-term tracking problems. Though it performs reasonably on the OOTB, its performance on occluded and cluttered environments such as PETS 2009 data sets is poor due to using less robust visual features in such environments. Even CF2 which uses CNN features has low performance compared to our proposed algorithm on the PETS 2009 data sets. Since our proposed tracking algorithm integrates a hybrid of multi-layer CNN and traditional features for learning the translation correlation filter and GM-PHD filter for temporally filtering generated high score detection proposals during a re-detection phase for removing clutter, it outperforms all the available trackers significantly. This closes the model-free tracking research gap between sparse and crowded environments.

Qualitative Evaluation: Fig. 7 presents the performance of our proposed tracker qualitatively compared to the state-of-the-art trackers. In this case, we show the comparison of four representative trackers to our proposed algorithm: CF2 [36], MEEM [21], LCT [39], and KCF [28] as shown in Fig 7. On the medium density PETS 2009 data set (left column), LCT and KCF lose the target even on the first 16 frames. Though, the CF2 and MEEM trackers track the target well, they could not re-detect the target after the occlusion i.e. only our proposed tracking algorithm tracks the target till the end of the sequence by re-initializing the tracker after the occlusion. We show the cropped and enlarged re-detection just after occlusion in Fig. 8. On the dense PETS data set (right column), all trackers track the target on the first 20 frames but LCT and KCF lose the target before 73 frames. Similar to the medium density PETS data set, the CF2 and MEEM trackers track the target before they lose it due to occlusion. Only our proposed tracking algorithm, LCMHT, re-detects the target and tracks it till the end of the sequence in such dense environments due to two reasons. First, it incorporates both lower and higher CNN layers in combination with traditional features (HOG and color-naming) in a multi-layer to learn the translation correlation filter that is robust to appearance variations of targets. Second, it includes a re-detection module which generates high score detection proposals during a re-detection phase and then filter them using GM-PHD filter to remove clutter due to background and other uninterested targets so that it can re-detect the target in such cluttered and dense environment. These make our proposed tracking algorithm outperform the other state-of-the-art trackers.

VII Conclusions

We have developed a novel long-term visual tracking algorithm by learning discriminative correlation filters and an incremental SVM classifier that can be applied for tracking of a target of interest in sparse as well as in crowded environments. We learn two different discriminative correlation filters: translation and scale correlation filters. For the translation correlation filter, we combine a hybrid of multi-layer CNN features trained on a large amount of object recognition data set (ImageNet) and traditional (HOG and color-naming) features in proper proportion. For the CNN part, we combine the advantages of both lower and higher convolutional layers to capture spatial details for precise localization and semantic information for handling appearance variations, respectively. We also include a re-detection module using HOG, LUV color and normalized gradient magnitude features for re-initializing the tracker in case of tracking failures due to long-term occlusions by training an incremental SVM from the most confident frames. The re-detection module generates high score detection proposals which are temporally filtered using a GM-PHD filter for removing clutter. The Gaussian component with maximum weight is selected as a state estimate which re-fines the object location when a re-detection module is activated. For the scale correlation filter, we use HOG features to construct a target pyramid around the estimated or re-detected position for estimating the scale of the target. Extensive experiments on both OOTB and PETS 2009 data sets show that our proposed algorithm significantly outperforms state-of-the-art trackers by 3.48% in distance precision and 7.77% in overlap success on sparse (OOTB) data sets, and by 36.87% in distance precision and 34.92% in overlap success on dense (PETS 2009) data sets. We conclude that learning correlation filters using an appropriate combination of CNN and traditional features as well as including a re-detection module using incremental SVM and GM-PHD filter can give better results than many existing approaches.

Acknowledgment

We would like to acknowledge the support of the Engineering and Physical Sciences Research Council (EPSRC), grant references EP/K009931, EP/J015180 and a James Watt Scholarship.

Bibliography53

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah, “Visual tracking: An experimental survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 36, no. 7, pp. 1442–1468, July 2014.
2[2] H. Yang, L. Shao, F. Zheng, L. Wang, and Z. Song, “Recent advances and trends in visual tracking: A review,” Neurocomputing , vol. 74, no. 18, pp. 3823 – 3831, 2011.
3[3] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Computer Vision and Pattern Recognition , 2014.
4[4] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” ICLR , 2015.
5[5] N. Wang and D.-Y. Yeung, “Learning a deep compact image representation for visual tracking,” in Advances in Neural Information Processing Systems 26 , C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds., 2013, pp. 809–817.
6[6] S. Hong, T. You, S. Kwak, and B. Han, “Online tracking by learning discriminative saliency map with convolutional neural network,” in Proceedings of the 32nd International Conference on Machine Learning, 2015, Lille, France, 6-11 July 2015 , 2015.
7[7] Z. Chen, Z. Hong, and D. Tao, “An experimental survey on correlation filter-based tracking,” Co RR , vol. abs/1509.05520, 2015.
8[8] B.-N. Vo and W.-K. Ma, “The Gaussian mixture probability hypothesis density filter,” Signal Processing, IEEE Transactions on , vol. 54, no. 11, pp. 4091–4104, Nov 2006.