Online Long-term Point Tracking in the Foundation Model Era
G\"orkay Aydemir

TL;DR
This paper introduces Track-On, a transformer-based model for online long-term point tracking in videos, leveraging foundation models and memory to operate causally without future frames, achieving state-of-the-art results.
Contribution
We propose Track-On, a novel transformer architecture that enables online long-term point tracking by maintaining temporal coherence without future frame access.
Findings
Set a new state of the art across seven benchmarks.
Demonstrated the effectiveness of foundation models in online tracking.
Validated the importance of memory for causal long-term tracking.
Abstract
Point tracking aims to identify the same physical point across video frames and serves as a geometry-aware representation of motion. This representation supports a wide range of applications, from robotics to augmented reality, by enabling accurate modeling of dynamic environments. Most existing long-term tracking approaches operate in an offline setting, where future frames are available to refine predictions and recover from occlusions. However, real-world scenarios often demand online predictions: the model must operate causally, using only current and past frames. This constraint is critical in streaming video and embodied AI, where decisions must be made immediately based on past observations. Under such constraints, viewpoint invariance becomes essential. Visual foundation models, trained on diverse large-scale datasets, offer the potential for robust geometric representations.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeological Modeling and Analysis
