Improving Siamese Based Trackers with Light or No Training through Multiple Templates and Temporal Network
Ali Sekhavati, Won-Sook Lee

TL;DR
This paper introduces a training-free framework for Siamese-based trackers that enhances performance by using multiple adaptive templates and a lightweight temporal network, applicable across various datasets and tracker architectures.
Contribution
It proposes a novel approach combining adaptive template updating and a universal temporal network, eliminating the need for retraining and improving tracking accuracy.
Findings
Improved performance on multiple datasets including LaSOT and TrackingNet.
Effective with both convolutional and transformer-based trackers.
Achieved better robustness to target appearance changes.
Abstract
High computational power and significant time are usually needed to train a deep learning based tracker on large datasets. Depending on many factors, training might not always be an option. In this paper, we propose a framework with two ideas on Siamese-based trackers. (i) Extending number of templates in a way that removes the need to retrain the network and (ii) a lightweight temporal network with a novel architecture focusing on both local and global information that can be used independently from trackers. Most Siamese-based trackers only rely on the first frame as the ground truth for objects and struggle when the target's appearance changes significantly in subsequent frames in presence of similar distractors. Some trackers use multiple templates which mostly rely on constant thresholds to update, or they replace those templates that have low similarity scores only with more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Visual Attention and Saliency Detection
Methodsfail · Siamese Network · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
