Deeper and Wider Siamese Networks for Real-Time Visual Tracking

Zhipeng Zhang; Houwen Peng

arXiv:1901.01660·cs.CV·March 29, 2019·110 cites

Deeper and Wider Siamese Networks for Real-Time Visual Tracking

Zhipeng Zhang, Houwen Peng

PDF

Open Access 5 Repos

TL;DR

This paper introduces deeper and wider Siamese network architectures with residual modules to improve real-time visual tracking accuracy and robustness, addressing issues caused by increased receptive fields and padding effects.

Contribution

The authors propose new residual modules and architectures that control receptive field size and mitigate padding bias, enhancing Siamese network performance in visual tracking.

Findings

01

Up to 9.8% improvement in AUC on OTB-15

02

Up to 23.3% improvement in EAO on VOT-16

03

Achieves real-time tracking speed with enhanced accuracy

Abstract

Siamese networks have drawn great attention in visual tracking because of their balanced accuracy and speed. However, the backbone networks used in Siamese trackers are relatively shallow, such as AlexNet [18], which does not fully take advantage of the capability of modern deep neural networks. In this paper, we investigate how to leverage deeper and wider convolutional neural networks to enhance tracking robustness and accuracy. We observe that direct replacement of backbones with existing powerful architectures, such as ResNet [14] and Inception [33], does not bring improvements. The main reasons are that 1)large increases in the receptive field of neurons lead to reduced feature discriminability and localization precision; and 2) the network padding for convolutions induces a positional bias in learning. To address these issues, we propose new residual modules to eliminate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Advanced Vision and Imaging · Image Enhancement Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · 1x1 Convolution · Convolution · Local Response Normalization · Grouped Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Dense Connections · Max Pooling · Softmax