TL;DR
This paper introduces an attention retrieval network and multi-stage segmentation to improve pixel-wise object tracking accuracy, effectively reducing background clutter influence and achieving state-of-the-art results at real-time speed.
Contribution
The paper proposes a novel attention retrieval network and multi-resolution segmentation approach that enhance pixel-wise tracking accuracy by mitigating background interference.
Findings
Achieved state-of-the-art performance on VOT2020 benchmark.
Surpassed SiamMask by significant margin on multiple datasets.
Operates at 40 fps, enabling real-time tracking.
Abstract
The encoding of the target in object tracking moves from the coarse bounding-box to fine-grained segmentation map recently. Revisiting de facto real-time approaches that are capable of predicting mask during tracking, we observed that they usually fork a light branch from the backbone network for segmentation. Although efficient, directly fusing backbone features without considering the negative influence of background clutter tends to introduce false-negative predictions, lagging the segmentation accuracy. To mitigate this problem, we propose an attention retrieval network (ARN) to perform soft spatial constraints on backbone features. We first build a look-up-table (LUT) with the ground-truth mask in the starting frame, and then retrieves the LUT to obtain an attention map for spatial constraints. Moreover, we introduce a multi-resolution multi-stage segmentation network (MMS) to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
