Exploring Reliable Spatiotemporal Dependencies for Efficient Visual Tracking

Junze Shi; Yang Yu; Jian Shi; Haibo Luo

arXiv:2601.09078·cs.CV·January 15, 2026

Exploring Reliable Spatiotemporal Dependencies for Efficient Visual Tracking

Junze Shi, Yang Yu, Jian Shi, Haibo Luo

PDF

Open Access

TL;DR

This paper introduces STDTrack, a lightweight visual tracker that leverages dense spatiotemporal sampling and novel modules to improve accuracy while maintaining real-time speed, bridging the gap with high-performance trackers.

Contribution

The paper proposes a novel framework integrating reliable spatiotemporal dependencies into lightweight trackers using dense sampling and new modules for enhanced performance.

Findings

01

Achieves state-of-the-art results on six benchmarks.

02

Operates at 192 FPS on GPU and 41 FPS on CPU.

03

Rivals high-performance non-real-time trackers like MixFormer.

Abstract

Recent advances in transformer-based lightweight object tracking have established new standards across benchmarks, leveraging the global receptive field and powerful feature extraction capabilities of attention mechanisms. Despite these achievements, existing methods universally employ sparse sampling during training--utilizing only one template and one search image per sequence--which fails to comprehensively explore spatiotemporal information in videos. This limitation constrains performance and cause the gap between lightweight and high-performance trackers. To bridge this divide while maintaining real-time efficiency, we propose STDTrack, a framework that pioneers the integration of reliable spatiotemporal dependencies into lightweight trackers. Our approach implements dense video sampling to maximize spatiotemporal information utilization. We introduce a temporally propagating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Visual Attention and Saliency Detection · Human Pose and Action Recognition