ACTrack: Adding Spatio-Temporal Condition for Visual Object Tracking

Yushan Han; Kaer Huang

arXiv:2403.07914·cs.CV·March 14, 2024·1 cites

ACTrack: Adding Spatio-Temporal Condition for Visual Object Tracking

Yushan Han, Kaer Huang

PDF

Open Access

TL;DR

ACTrack introduces a lightweight additive spatio-temporal model for visual object tracking that preserves pre-trained backbone capabilities while improving efficiency and performance.

Contribution

It proposes a novel additive spatio-temporal framework with a lightweight net that maintains pre-trained model quality and enhances tracking efficiency.

Findings

01

Balances training efficiency and tracking performance.

02

Outperforms existing methods on multiple benchmarks.

03

Preserves pre-trained Transformer backbone capabilities.

Abstract

Efficiently modeling spatio-temporal relations of objects is a key challenge in visual object tracking (VOT). Existing methods track by appearance-based similarity or long-term relation modeling, resulting in rich temporal contexts between consecutive frames being easily overlooked. Moreover, training trackers from scratch or fine-tuning large pre-trained models needs more time and memory consumption. In this paper, we present ACTrack, a new tracking framework with additive spatio-temporal conditions. It preserves the quality and capabilities of the pre-trained Transformer backbone by freezing its parameters, and makes a trainable lightweight additive net to model spatio-temporal relations in tracking. We design an additive siamese convolutional network to ensure the integrity of spatial features and perform temporal sequence modeling to simplify the tracking pipeline. Experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection

MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Layer Normalization · Absolute Position Encodings · Residual Connection · Dropout · Softmax · Linear Layer · Multi-Head Attention