CSTrack: Enhancing RGB-X Tracking via Compact Spatiotemporal Features
X. Feng, D. Zhang, S. Hu, X. Li, M. Wu, J. Zhang, X. Chen, K. Huang

TL;DR
CSTrack introduces a novel approach for RGB-X tracking by modeling compact spatiotemporal features, simplifying the architecture, reducing computational overhead, and achieving state-of-the-art results on benchmarks.
Contribution
The paper proposes a new tracker with a Spatial Compact Module and a Temporal Compact Module for efficient and effective spatiotemporal feature modeling in RGB-X tracking.
Findings
Achieves new state-of-the-art results on RGB-X benchmarks.
Reduces computational complexity compared to existing methods.
Effectively models intra- and inter-modality spatial and temporal features.
Abstract
Effectively modeling and utilizing spatiotemporal features from RGB and other modalities (\eg, depth, thermal, and event data, denoted as X) is the core of RGB-X tracker design. Existing methods often employ two parallel branches to separately process the RGB and X input streams, requiring the model to simultaneously handle two dispersed feature spaces, which complicates both the model structure and computation process. More critically, intra-modality spatial modeling within each dispersed space incurs substantial computational overhead, limiting resources for inter-modality spatial modeling and temporal modeling. To address this, we propose a novel tracker, CSTrack, which focuses on modeling Compact Spatiotemporal features to achieve simple yet effective tracking. Specifically, we first introduce an innovative Spatial Compact Module that integrates the RGB-X dual input streams into a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Vision and Imaging · Visual Attention and Saliency Detection
