Track Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation

Finlay G. C. Hudson; William A. P. Smith

arXiv:2411.19210·cs.CV·March 6, 2026

Track Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation

Finlay G. C. Hudson, William A. P. Smith

PDF

Open Access

TL;DR

This paper introduces TABE, a zero-shot amodal video object segmentation pipeline that leverages a pretrained video diffusion model and a single initial mask to perform occlusion-aware object tracking without additional training.

Contribution

The novel TABE pipeline enables zero-shot amodal segmentation using a pretrained diffusion model and a single initial mask, eliminating the need for class-specific training.

Findings

01

Effective amodal segmentation even with full occlusion

02

No re-training needed for new objects or classes

03

Outperforms existing methods in zero-shot scenarios

Abstract

We present Track Anything Behind Everything (TABE), a novel pipeline for zero-shot amodal video object segmentation. Unlike existing methods that require pretrained class labels, our approach uses a single query mask from the first frame where the object is visible, enabling flexible, zero-shot inference. We pose amodal segmentation as generative outpainting from modal (visible) masks using a pretrained video diffusion model. We do not need to re-train the diffusion model to accommodate additional input channels but instead use a pretrained model that we fine-tune at test-time to allow specialisation towards the tracked object. Our TABE pipeline is specifically designed to handle amodal completion, even in scenarios where objects are completely occluded. Our model and code will all be released.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis