TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement
Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf, Aytar, Joao Carreira, Andrew Zisserman

TL;DR
TAPIR is a new model that accurately tracks any queried point on physical surfaces throughout videos, using a two-stage process of matching and refinement, achieving state-of-the-art results and enabling novel trajectory-based animations.
Contribution
The paper introduces TAPIR, a novel two-stage tracking model that improves accuracy and speed for tracking points in videos, and demonstrates its extension to generate trajectories from static images.
Findings
Surpasses baseline methods by ~20% on TAP-Vid benchmark.
Enables real-time tracking on high-resolution videos.
Facilitates trajectory-based animations from static images.
Abstract
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Our model facilitates fast inference on long and high-resolution video sequences. On a modern GPU, our implementation has the capacity to track points faster than real-time, and can be flexibly extended to higher-resolution videos. Given the high-quality trajectories extracted from a large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
TAPIR: Tracking Any Point with Per-Frame Initialization and Temporal Refinement· youtube
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Vision and Imaging · Advanced Image and Video Retrieval Techniques
MethodsDiffusion · Depthwise Convolution
