Tracking through Containers and Occluders in the Wild
Basile Van Hoorick, Pavel Tokmakov, Simon Stent, Jie Li, Carl Vondrick

TL;DR
This paper introduces TCOW, a benchmark and model for tracking objects through heavy occlusion and containment in cluttered environments, highlighting current model limitations in understanding object permanence.
Contribution
The paper presents a new benchmark and dataset for tracking through occlusion and containment, along with an evaluation of transformer-based models on this challenging task.
Findings
Transformer models perform variably under occlusion conditions.
Significant performance gap remains in understanding object permanence.
The dataset supports both supervised learning and structured evaluation.
Abstract
Tracking objects with persistence in cluttered and dynamic environments remains a difficult challenge for computer vision systems. In this paper, we introduce , a new benchmark and model for visual tracking through heavy occlusion and containment. We set up a task where the goal is to, given a video sequence, segment both the projected extent of the target object, as well as the surrounding container or occluder whenever one exists. To study this task, we create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance under various forms of task variation, such as moving or nested containment. We evaluate two recent transformer-based video models and find that while they can be surprisingly capable of tracking targets under certain settings of task variation, there remains a considerable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Visual Attention and Saliency Detection · Human Pose and Action Recognition
