Self-Supervised RGB-T Tracking with Cross-Input Consistency
Xingchen Zhang, Yiannis Demiris

TL;DR
This paper introduces the first self-supervised RGB-T tracking method that learns from unlabeled video pairs, using cross-input consistency to achieve competitive performance without annotated data.
Contribution
It presents a novel self-supervised training strategy for RGB-T tracking based on cross-input consistency, eliminating the need for labeled RGB-T datasets.
Findings
Outperforms seven supervised RGB-T trackers on GTOT dataset
Effective training with only unlabeled RGB-T video pairs
Proves the viability of self-supervised learning in RGB-T tracking
Abstract
In this paper, we propose a self-supervised RGB-T tracking method. Different from existing deep RGB-T trackers that use a large number of annotated RGB-T image pairs for training, our RGB-T tracker is trained using unlabeled RGB-T video pairs in a self-supervised manner. We propose a novel cross-input consistency-based self-supervised training strategy based on the idea that tracking can be performed using different inputs. Specifically, we construct two distinct inputs using unlabeled RGB-T video pairs. We then track objects using these two inputs to generate results, based on which we construct our cross-input consistency loss. Meanwhile, we propose a reweighting strategy to make our loss function robust to low-quality training samples. We build our tracker on a Siamese correlation filter network. To the best of our knowledge, our tracker is the first self-supervised RGB-T tracker.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
