TL;DR
Generative Point Tracker (GenPT) introduces a novel flow matching framework to model multi-modal point trajectories, improving accuracy in occluded and ambiguous scenarios by leveraging generative sampling and confidence-guided inference.
Contribution
The paper presents GenPT, a generative framework with flow matching for multi-modal trajectory modeling, outperforming discriminative models especially in occlusion scenarios.
Findings
State-of-the-art accuracy on PointOdyssey, Dynamic Replica, TAP-Vid benchmarks.
Effective multi-modality capture in point trajectories.
Enhanced occluded point tracking performance.
Abstract
Tracking a point through a video can be a challenging task due to uncertainty arising from visual obfuscations, such as appearance changes and occlusions. Although current state-of-the-art discriminative models excel in regressing long-term point trajectory estimates -- even through occlusions -- they are limited to regressing to a mean (or mode) in the presence of uncertainty, and fail to capture multi-modality. To overcome this limitation, we introduce Generative Point Tracker (GenPT), a generative framework for modelling multi-modal trajectories. GenPT is trained with a novel flow matching formulation that combines the iterative refinement of discriminative trackers, a window-dependent prior for cross-window consistency, and a variance schedule tuned specifically for point coordinates. We show how our model's generative capabilities can be leveraged to improve point trajectory…
Peer Reviews
Decision·Submitted to ICLR 2026
- GenPT can model and sample from multiple plausible trajectory candidates, particularly when tracking uncertainty is high due to occlusion. This translates directly to state-of-the-art tracking accuracy on occluded points. - The model effectively transitions between probabilistic and quasi-deterministic behavior. While always generative, its prediction variance tightly contracts (becoming nearly deterministic) when the tracked point is clearly visible and uniquely identifiable.
- There is a substantial and recurring performance gap between the Oracle scores (the model's maximum potential) and the Greedy scores (the model's actual performance when relying on its confidence). This fundamental disconnect means the model is poor at judging the quality of the trajectories it generates, limiting the real-world utility of its multi-modality. - The advertised speed advantage (2x faster than CoTracker3) is strictly limited to generating a single sample. To achieve the demonstra
1. This paper introduces the first generative point tracker trained using a modified flow-matching objective for trajectories, extending generative modeling concepts to the task of point tracking. 2. The authors design three key modules: iterative refinement, window-dependent prior, and variance schedule. These components are well-motivated and thoroughly ablated.
1. Point tracking is inherently a deterministic problem, so a multi-modal approach may not be well-suited for this task. 2. The improvements of this model mainly target occluded points. However, the objective function used in models such as CoTracker3 or other similar approaches is typically L=Huber_loss(predicted point,ground truth point)×is_visible_gt(this point) In other words, these models are not explicitly designed to predict occluded points. 3. The greedy search strategy requires running
- The paper tackles a genuine limitation of current discriminative point trackers, their inability to represent uncertainty and multimodal hypotheses in ambiguous or occluded regions. - The authors provide comprehensive comparisons across several datasets
### Lack of generative insight Although the paper positions itself as a generative reformulation of tracking, the actual mechanism remains deterministic iterative optimization under Gaussian perturbation, not a generative process. - In generative models (diffusion or rectified flow), the model learns to map **pure noise --> data samples**, learning meaningful dynamics along a linear trajectory in data space. - In GenPT, the model learns **query + noise --> correspondence**, where the starting po
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
