# Siamese Attentional Keypoint Network for High Performance Visual   Tracking

**Authors:** Peng Gao, Ruyue Yuan, Fei Wang, Liyi Xiao, Hamido Fujita, Yan Zhang

arXiv: 1904.10128 · 2020-01-01

## TL;DR

This paper introduces SATIN, a novel Siamese attentional keypoint network that enhances visual tracking accuracy and efficiency by combining a lightweight hourglass backbone, cross-attention modules, and keypoint detection for precise object localization.

## Contribution

The paper proposes the first approach combining a Siamese hourglass network, cross-attention, and keypoint detection for high-performance visual tracking.

## Key findings

- Achieves state-of-the-art results on benchmark datasets.
- Runs at over 27 frames per second.
- Offers improved localization and discriminative capabilities.

## Abstract

In this paper, we investigate the impacts of three main aspects of visual tracking, i.e., the backbone network, the attentional mechanism, and the detection component, and propose a Siamese Attentional Keypoint Network, dubbed SATIN, for efficient tracking and accurate localization. Firstly, a new Siamese lightweight hourglass network is specially designed for visual tracking. It takes advantage of the benefits of the repeated bottom-up and top-down inference to capture more global and local contextual information at multiple scales. Secondly, a novel cross-attentional module is utilized to leverage both channel-wise and spatial intermediate attentional information, which can enhance both discriminative and localization capabilities of feature maps. Thirdly, a keypoints detection approach is invented to trace any target object by detecting the top-left corner point, the centroid point, and the bottom-right corner point of its bounding box. Therefore, our SATIN tracker not only has a strong capability to learn more effective object representations, but also is computational and memory storage efficiency, either during the training or testing stages. To the best of our knowledge, we are the first to propose this approach. Without bells and whistles, experimental results demonstrate that our approach achieves state-of-the-art performance on several recent benchmark datasets, at a speed far exceeding 27 frames per second.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.10128/full.md

## Figures

18 figures with captions in the complete paper: https://tomesphere.com/paper/1904.10128/full.md

## References

71 references — full list in the complete paper: https://tomesphere.com/paper/1904.10128/full.md

---
Source: https://tomesphere.com/paper/1904.10128