Learning Cascaded Siamese Networks for High Performance Visual Tracking
Peng Gao, Yipeng Ma, Ruyue Yuan, Liyi Xiao, Fei Wang

TL;DR
This paper introduces a cascaded Siamese network for visual tracking that combines offline-trained matching and online-updated classification subnetworks, achieving high accuracy in challenging scenarios.
Contribution
The novel cascaded Siamese network architecture integrates matching and classification subnetworks with an effective update method, advancing visual tracking performance.
Findings
Achieves state-of-the-art results on benchmark datasets.
Effectively handles negative scenarios in visual tracking.
Online classification subnetwork improves target-specific adaptation.
Abstract
Visual tracking is one of the most challenging computer vision problems. In order to achieve high performance visual tracking in various negative scenarios, a novel cascaded Siamese network is proposed and developed based on two different deep learning networks: a matching subnetwork and a classification subnetwork. The matching subnetwork is a fully convolutional Siamese network. According to the similarity score between the exemplar image and the candidate image, it aims to search possible object positions and crop scaled candidate patches. The classification subnetwork is designed to further evaluate the cropped candidate patches and determine the optimal tracking results based on the classification score. The matching subnetwork is trained offline and fixed online, while the classification subnetwork performs stochastic gradient descent online to learn more target-specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Visual Attention and Saliency Detection · Image Enhancement Techniques
MethodsSiamese Network
LEARNING CASCADED SIAMESE NETWORKS FOR HIGH PERFORMANCE
VISUAL TRACKING
Abstract
Visual tracking is one of the most challenging computer vision problems. In order to achieve high performance visual tracking in various negative scenarios, a novel cascaded Siamese network is proposed and developed based on two different deep learning networks: a matching subnetwork and a classification subnetwork. The matching subnetwork is a fully convolutional Siamese network. According to the similarity score between the exemplar image and the candidate image, it aims to search possible object positions and crop scaled candidate patches. The classification subnetwork is designed to further evaluate the cropped candidate patches and determine the optimal tracking results based on the classification score. The matching subnetwork is trained offline and fixed online, while the classification subnetwork performs stochastic gradient descent online to learn more target-specific information. To improve the tracking performance further, an effective classification subnetwork update method based on both similarity and classification scores is utilized for updating the classification subnetwork. Extensive experimental results demonstrate that our proposed approach achieves state-of-the-art performance in recent benchmarks.
**Index Terms— ** Visual tracking, object detection, Siamese networks, cascaded learning
1 Introduction
Visual tracking is a most fundamental research issue in the field of computer vision, and it is widely developed in numerous applications, such as video surveillance, drone tracking, self-driving vehicle, human-computer interaction, auxiliary medical diagnosis, and many others [1, 2]. Normally, tracking task is to estimate the trajectory of an arbitrary target in an image sequence, given only its initial location at the first frame. Despite the excellent results achieved by numerous tracking approaches [3, 4, 5, 6, 7, 8] in the past decades, visual tracking is still a challenging problem owing to complicated factors like fast motions, background clutters, motion blurs, deformations, illumination variations, low resolution, occlusions, out of views, scale variations, etc.
In recent years, with the tremendous development of deep learning technology [11, 12, 13, 14], convolutional neural networks (CNN) have attracted increasing attention in the tracking community. Compared with the conventional handcrafted features based trackers [3, 4, 15, 16, 17], CNN based trackers [18, 19, 20, 5, 21, 22] can easily obtain more competitive tracking performance in multiple benchmarks [23, 10, 24]. In general, existing CNN based tracking approaches can be divided into two categories, i.e., matching based trackers and classification based trackers. The former is always pre-trained offline on the video object detection dataset of the ImageNet [25]. During tracking, it matches the candidates with the exemplar by correlating deep features and does not need online updating. In contrast, the classification based tracking approach transfers a pre-trained network as the classifier and then performs online updating by adding some particular layers [5]. Although all the CNN based trackers above mentioned have obtained impressive tracking results, there is still great potential to enhance performance further.
In this paper, we propose a novel cascaded Siamese network for high performance visual tracking by integrating both the matching and classification networks. First, a matching subnetwork is exploited to measure the similarity between candidate image and exemplar image and crop scaled candidate patches based on the similarity score. Then, a classification subnetwork which is cascaded with the matching subnetwork learns a target-specific classification scheme online to further determine the optimal tracking results among all scaled candidate patches based on the classification score. Finally, both similarity and classification scores are combined together to indicate whether the classification subnetwork should be updated online or not.
Our main contributions are three folds and summarized as follows:
- •
We propose a novel cascaded Siamese network for high performance visual tracking, which consists of a matching subnetwork and a classification subnetwork.
- •
We utilize an effective model update method to determine the necessity for classification subnetwork online updating.
- •
We conduct extensive experiments on several recent tracking benchmarks, our proposed approach achieves surprisingly good performance both in terms of accuracy and robustness, as shown in Fig. 1.
2 Algorithmic Overview
The overall framework of our proposed approach is shown in Fig. 2. The proposed approach consists of a matching subnetwork for target localization and scaled candidate patches creation and a classification subnetwork for optimal tracking results determination. During the tracking process, an exemplar image x of size and a candidate image z of size both centered around the previous position of the target are first fed into the matching subnetwork. The matching subnetwork imitates the fully-convolutional Siamese architecture [6], and the similarity between the exemplar image and the candidate image is estimated by calculating the cross-correlation based on their deep features. Then, the possible target positions are chosen by searching the maximum similarity scores, and scaled candidate patches centered at all possible target positions are cropped on the candidate image. Here, the scaling method is similar to that of DSST tracker [16]. Next, the scaled candidate patches are resize to and classified into foreground or background by the classification subnetwork, and the patch with the highest foreground score will be determined as the optimal tracking result. Finally, we update the classification subnetwork online based on the combination of both similarity and classification score corresponding to the optimal tracking result.
3 The proposed approach
3.1 Matching Subnetwork
In our matching subnetwork, we adopt a fully-convolutional Siamese network which is pre-trained offline with a large video object detection dataset [25] in an end-to-end manner as the deep feature extractor [6]. Our aim is to learn a function to compare the exemplar image x with the candidate image z of the same size, where and represent the deep feature maps and is a similarity metric. We utilize a cross-correlation layer to measure the similarity between the output deep features,
[TABLE]
where denotes the cross-correlation operation, and indicates the bias. Thus, the output indicates a similarity score map corresponding to the exemplar image compared to the candidate image.
The localization of the target can be estimated at the highest peak on the similarity score map. However, since a video stream always undergoes variations such as fast motion, illumination variation and occlusion, the similarity measurement may be disturbed by similar objects or background noises in the candidate image as shown in Fig. 2, and there possibly exist multiple peaks on the similarity score map and the target may locate at one of them. If we estimate the target at wrong peaks, it will leads to inaccurate localization and tracking drift. To solve this problem, we use the classification subnetwork to further determine both the optimal target position and size among all the peaks.
3.2 Classification Subnetwork
In Section 3.1, we obtain a similarity score map by cross-correlating the output deep features of the feature extractor. Since the similarity score map may not be reliable enough, we treat peaks whose ratio between its score and that of the highest peak exceeding a certain threshold as possible target positions, and the corresponding patches centered at these positions are cropped and scaled as mentioned in Section 2. After that, a series of scaled candidate patches can be obtained. Thus, we exploit a classification subnetwork for optimal tracking results determination.
The classification subnetwork architecture is similar to that of MDNet [5] which has three convolutional layers, two fully connected layers and a binary classification layer with softmax cross-entropy loss to output the probabilities of target and background classes, as shown in Fig. 2.
Finally, the candidate patch with the highest classification score in the target class will be selected as the optimal tracking result.
3.3 Updating Method
During tracking, the parameter of the matching subnetwork are fixed, and all the classification layer and the fully connected layers of the classification subnetwork are fine-tuning online to adapt to variations based on optimal tracking results in the current frame. However, the optimal tracking results are not always reliable for classification subnetwork updates. Inappropriate updates may break down the classification subnetwork due to the ambiguous tracking results.
In order to alleviate this issue, we utilize a simple but effective method for classification subnetwork updating. Assume the similarity and classification scores of current optimal tracking results are and respectively, and the historical scores of previous frames are and . If there are no other peaks on the similarity score map that exceed a ratio of the highest peak value, the classification subnetwork will be updated directly based on the current optimal tracking result. In contrast, if there has one or more peaks exceeds the ratio of the highest peak value, we compare both similarity and classification scores with the historical scores. Only when these two scores and are great than and of their corresponding historical score and respectively, we update the last three layers of our classification subnetwork.
4 Experiments
In this section, we conduct extensive experiments to validate the effectiveness of our proposed cascaded Siamese network. We first detail the implementation of our approach. Then, we investigate the impact of the architecture of the matching and classification subnetworks as well the update method. Finally, we compare our approach with nine state-of-the-art trackers including ECO [7], CCOT [20], MLCFT [9], CACT [4], Staple [17], MDNet [5], SiamFC [6], KCF [3] and DSST [16] on three tracking benchmarks: OTB-2013 [23], OTB-2015 [10] and VOT-2016 [24]. The experiments on OTB benchmarks are exploiting two metrics: distance precision and overlap success rate, while the expected average overlap (EAO) is exploited in the VOT dataset.
4.1 Implementation Details
Network Architecture. In the matching subnetwork, we exploit ResNet [13] for deep feature extraction, which followed by a cross-correlation layer. The convolutional layers of the classification network are identical to the corresponding parts of VGG-M [12], the fully connected layers have 512 output units and the classification layer output 2 scores as described in MDNet [5].
Offline Training. For the training process of both matching and classification subnetworks, sample pairs are selected from the ImageNet video object detection dataset [25] with random interval. The exemplar and candidate images are picked from the same video. We first load the pre-trained networks to initialize our approach. Then, we apply stochastic gradient descent (SGD) with the learning rate set from to and the momentum of 0.9 to train the networks end-to-end, respectively. More details about the training methods can be found in [6] and [5].
Online Tracking. During the tracking process, we only update the parameters of the last three layers of the classification subnetwork, and others are fixed. The candidate image is cropped approximately four times the target size centered at the previous position. The certain thresholds , and are set to 0.75, 0.8 and 0.6, respectively. The number of historical frames is set to 6. Moreover, we exploit three scales to crop candidate pathes at each possible target position.
Our approach is implemented using MXNet [26] on an Amazon EC2 instance with an Intel Xeon E5 CPU, 61GB RAM and a NVIDIA K80 GPU, 12GB VRAM. It is worth to mention that we retrained MDNet [5] on ImageNet [25] since the original MDNet is training with tracking videos that may cause unfair performance over other tracking approaches.
4.2 Ablation Studies
To verify the effectiveness of our designed matching and classification subnetwork as well the update method in our cascaded Siamese network, we conduct ablation studies on OTB-2015 benchmark. The result is shown in Fig. 3.
It is clear that the performances of all the variations which are implemented using the components indicated in the plot legend are not as good as our full approach, and each component in our tracking framework is helpful to improve performance. A noteworthy is only our final implementation, denoted by Ours, employs the update method.
4.3 Results on OTB
We show the success rate-precision ranking plots on OTB-2013 and OTB-2015 benchmarks [23, 10] in Fig. 4. It illustrates that the proposed tracker performs better than other re-detection trackers MLCFT and CACT, but is less effective than ECO which exploits continuous convolutional filters.
Overall, our approach attains surprisingly excellent performance both in terms of accuracy and robustness.
4.4 Results on VOT
We also evaluate our proposed approach on the VOT-2016 dataset [24] as shown in Fig. 5. The horizontal grey line indicates the state-of-the-art bound according to the VOT committee. Our tracker ranks second in overall performance evaluations based on the EAO measure. Specifically, the performance of our approach excels the CCOT [20] tracker which achieves the best results in the original VOT-2016 challenge.
SiamFC [6] and MDNet [5] are the baselines of the proposed approach. Compared to them, our tracker not only learns a matching subnetwork to search the possible target positions, but also benefits from the classification subnetwork to determine the optimal tracking results. What is more, the effective classification subnetwork updating method ensure the robustness of the tracker. Therefore, our cascaded Siamese network outperforms them with a large margin.
5 Conclusion
In this paper, we propose a cascaded Siamese network for high performance visual tracking. Our proposed approach consists of the matching subnetwork for similarity learning and the classification subnetwork for optimal target result determination. Extensive experiments on three recent tracking benchmarks demonstrate competing performance of the proposed tracker over a number of state-of-the-art approaches.
6 Acknowledgment
This work was supported by the National Natural Science Foundation of China under Grant No. 31701187, the Guangdong Provincial Science and Technology Planning Program under Grant No. 2016B090918047, and Promotional Credit from Amazon Web Service, Inc.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Alper Yilmaz, Omar Javed, and Mubarak Shah, “Object tracking: A survey,” ACM Computing Surveys , vol. 38, no. 4, pp. 13, 2006.
- 2[2] Arnold WM Smeulders, Dung M Chu, Rita Cucchiara, Simone Calderara, Afshin Dehghan, and Mubarak Shah, “Visual tracking: An experimental survey,” IEEE TPAMI , vol. 36, no. 7, pp. 1442–1468, 2014.
- 3[3] João F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista, “High-speed tracking with kernelized correlation filters,” IEEE TPAMI , vol. 37, no. 3, pp. 583–596, 2015.
- 4[4] Peng Gao, Yipeng Ma, Chao Li, Ke Song, Yan Zhang, Fei Wang, and Liyi Xiao, “Adaptive object tracking with complementary models,” IEICE Transactions on Information and Systems , vol. E 101-D, no. 11, 2018.
- 5[5] Hyeonseob Nam and Bohyung Han, “Learning multi-domain convolutional neural networks for visual tracking,” in CVPR , 2016.
- 6[6] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr, “Fully-convolutional siamese networks for object tracking,” in ECCV , 2016.
- 7[7] Martin Danelljan, Goutam Bhat, Shahbaz Fahad Khan, and Michael Felsberg, “Eco: Efficient convolution operators for tracking,” in CVPR , 2017.
- 8[8] Peng Gao, Yipeng Ma, Ke Song, Chao Li, Fei Wang, Liyi Xiao, and Yan Zhang, “High performance visual tracking with circular and structural operators,” Knowledge-Based Systems , vol. 161, pp. 240–253, 2018.
