TL;DR
This paper introduces a scalable unsupervised object tracking architecture that leverages spatial invariance, outperforming existing methods in cluttered scenes and generalizing well to larger and more complex videos.
Contribution
It presents a novel architecture using spatially invariant computations to improve unsupervised object tracking in large, cluttered scenes, addressing scalability and generalization issues.
Findings
Outperforms competing methods in cluttered scenes
Generalizes well to larger and more complex videos
Effective in tracking many objects without supervision
Abstract
The ability to detect and track objects in the visual world is a crucial skill for any intelligent agent, as it is a necessary precursor to any object-level reasoning process. Moreover, it is important that agents learn to track objects without supervision (i.e. without access to annotated training videos) since this will allow agents to begin operating in new environments with minimal human assistance. The task of learning to discover and track objects in videos, which we call \textit{unsupervised object tracking}, has grown in prominence in recent years; however, most architectures that address it still struggle to deal with large scenes containing many objects. In the current work, we propose an architecture that scales well to the large-scene, many-object setting by employing spatially invariant computations (convolutions and spatial attention) and representations (a spatially local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
