DeTrack: In-model Latent Denoising Learning for Visual Object Tracking
Xinyu Zhou, Jinglun Li, Lingyi Hong, Kaixun Jiang, Pinxue Guo, Weifeng, Ge, Wenqiang Zhang

TL;DR
DeTrack introduces an in-model latent denoising learning paradigm for visual object tracking, leveraging a denoising Vision Transformer to improve robustness and accuracy without sacrificing real-time performance.
Contribution
The paper proposes a novel in-model latent denoising framework using a denoising Vision Transformer, enabling robust and efficient visual object tracking with a new training paradigm.
Findings
Achieves competitive results on challenging datasets.
Enhances robustness to unseen data through denoising training.
Maintains real-time tracking speed with in-model denoising blocks.
Abstract
Previous visual object tracking methods employ image-feature regression models or coordinate autoregression models for bounding box prediction. Image-feature regression methods heavily depend on matching results and do not utilize positional prior, while the autoregressive approach can only be trained using bounding boxes available in the training set, potentially resulting in suboptimal performance during testing with unseen data. Inspired by the diffusion model, denoising learning enhances the model's robustness to unseen data. Therefore, We introduce noise to bounding boxes, generating noisy boxes for training, thus enhancing model robustness on testing data. We propose a new paradigm to formulate the visual object tracking problem as a denoising learning process. However, tracking algorithms are usually asked to run in real-time, directly applying the diffusion model to object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Fire Detection and Safety Systems
MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Absolute Position Encodings · Dropout · Linear Layer · Softmax · Adam · Residual Connection · Vision Transformer
