Track-On: Transformer-based Online Point Tracking with Memory
G\"orkay Aydemir, Xiongyi Cai, Weidi Xie, Fatma G\"uney

TL;DR
Track-On is a transformer-based online point tracking model that uses memory modules to achieve accurate, real-time long-term tracking in videos without future frame access, outperforming previous methods.
Contribution
The paper introduces a novel online transformer model with memory modules for long-term point tracking, enabling real-time performance without future frame information.
Findings
Sets new state-of-the-art for online point tracking.
Outperforms offline methods on multiple datasets.
Robust in diverse real-world scenarios.
Abstract
In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across multiple frames in a video, despite changes in appearance, lighting, perspective, and occlusions. We target online tracking on a frame-by-frame basis, making it suitable for real-world, streaming scenarios. Specifically, we introduce Track-On, a simple transformer-based model designed for online long-term point tracking. Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames, leveraging two memory modules -- spatial memory and context memory -- to capture temporal information and maintain reliable point tracking over long time horizons. At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy. Through extensive…
Peer Reviews
Decision·ICLR 2025 Poster
The dual memory module design effectively handles long video sequences in an online setting while addressing feature drift. Track-On demonstrates fast inference speeds while maintaining high accuracy, a significant advantage in real-time tracking systems. The model establishes a new benchmark for online models on the TAP-Vid benchmark and shows competitive performance against offline methods.
The paper mentions potential precision loss when tracking thin surfaces or instances with similar appearances, indicating limitations in handling certain visual effects. The model may struggle to establish correct correspondences in complex scenes with high visual similarity, affecting tracking accuracy. While the model is memory-efficient, there is a trade-off between memory module size and inference speed, which may require balancing in practical applications.
- The paper is well-written and easy to follow, with extensive experiments that clearly demonstrate the performance improvements contributed by each module. - The authors propose an effective approach using patch classification, rather than regression, to predict each point location; while patch classification was mentioned in the PIPs paper, it was previously used only to accelerate convergence by supervising score maps. - The authors propose 2 memory modules that effectively improve the track
- line 289: I think you meant $\gamma^s$ not $\gamma^q$ which does not appears in eq.8. - Table 1: the authors should show the backbone used in each approach for easier comparison. If possible, the methods should be compared on the same backbone. However, this is alleviated by table 3 where the authors have show the effectiveness of the memory modules.
1.The author designed two memory modules to mine spatio-temporal information and use coarse to fine manner for more accurate point prediction. 2.Experiments demonstrate the effectiveness of the method proposed by the authors and obtain the SOTA performance of the online point tracking method.
1.Experiments were inadequate. The designed analysis and the layer number ablation analysis of each modules are not enough, including several decoders, the number (4) of level of similar map used for patch classification, etc. 2.The figures in the paper are not clear enough. For example, in the left of Figure 5, the input qinit does not understand what it means, because q has no subscript and does not specify its meaning. In additional, the title of Figure 5 should go from left to right when in
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInertial Sensor and Navigation · Advanced Vision and Imaging · Iterative Learning Control Systems
