Fast Online Object Tracking and Segmentation: A Unifying Approach
Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, Philip H.S. Torr

TL;DR
SiamMask is a real-time, unified approach for visual object tracking and semi-supervised video segmentation that achieves state-of-the-art performance and high speed by augmenting Siamese networks with segmentation capabilities.
Contribution
The paper introduces SiamMask, a simple yet effective method that unifies object tracking and segmentation with real-time performance, improving training with a segmentation loss.
Findings
Achieves 55 fps on VOT-2018 for tracking.
Sets new state-of-the-art among real-time trackers.
Demonstrates competitive segmentation performance on DAVIS datasets.
Abstract
In this paper we illustrate how to perform both visual object tracking and semi-supervised video object segmentation, in real-time, with a single simple approach. Our method, dubbed SiamMask, improves the offline training procedure of popular fully-convolutional Siamese approaches for object tracking by augmenting their loss with a binary segmentation task. Once trained, SiamMask solely relies on a single bounding box initialisation and operates online, producing class-agnostic object segmentation masks and rotated bounding boxes at 55 frames per second. Despite its simplicity, versatility and fast speed, our strategy allows us to establish a new state of the art among real-time trackers on VOT-2018, while at the same time demonstrating competitive performance and the best speed for the semi-supervised video object segmentation task on DAVIS-2016 and DAVIS-2017. The project website is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
