TL;DR
This paper introduces AttU-Net, a robust unsupervised multi-object tracking model that maintains high performance in noisy environments by learning multi-scale visual representations, outperforming existing methods on various benchmarks.
Contribution
The paper presents AttU-Net, a novel single-head attention model for unsupervised MOT that is resilient to noise and improves upon variational inference-based baselines.
Findings
AttU-Net outperforms state-of-the-art baselines in noisy conditions.
The model maintains high tracking accuracy on new datasets.
Robustness is validated on MNIST-MOT, Atari, and extended datasets.
Abstract
Physical processes, camera movement, and unpredictable environmental conditions like the presence of dust can induce noise and artifacts in video feeds. We observe that popular unsupervised MOT methods are dependent on noise-free inputs. We show that the addition of a small amount of artificial random noise causes a sharp degradation in model performance on benchmark metrics. We resolve this problem by introducing a robust unsupervised multi-object tracking (MOT) model: AttU-Net. The proposed single-head attention model helps limit the negative impact of noise by learning visual representations at different segment scales. AttU-Net shows better unsupervised MOT tracking performance over variational inference-based state-of-the-art baselines. We evaluate our method in the MNIST-MOT and the Atari game video benchmark. We also provide two extended video datasets: ``Kuzushiji-MNIST MOT''…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
