Track-On: Transformer-based Online Point Tracking with Memory

G\"orkay Aydemir; Xiongyi Cai; Weidi Xie; Fatma G\"uney

arXiv:2501.18487·cs.CV·January 31, 2025

Track-On: Transformer-based Online Point Tracking with Memory

G\"orkay Aydemir, Xiongyi Cai, Weidi Xie, Fatma G\"uney

PDF

Open Access 1 Repo 3 Reviews

TL;DR

Track-On is a transformer-based online point tracking model that uses memory modules to achieve accurate, real-time long-term tracking in videos without future frame access, outperforming previous methods.

Contribution

The paper introduces a novel online transformer model with memory modules for long-term point tracking, enabling real-time performance without future frame information.

Findings

01

Sets new state-of-the-art for online point tracking.

02

Outperforms offline methods on multiple datasets.

03

Robust in diverse real-world scenarios.

Abstract

In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across multiple frames in a video, despite changes in appearance, lighting, perspective, and occlusions. We target online tracking on a frame-by-frame basis, making it suitable for real-world, streaming scenarios. Specifically, we introduce Track-On, a simple transformer-based model designed for online long-term point tracking. Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames, leveraging two memory modules -- spatial memory and context memory -- to capture temporal information and maintain reliable point tracking over long time horizons. At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy. Through extensive…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 4

Strengths

The dual memory module design effectively handles long video sequences in an online setting while addressing feature drift. Track-On demonstrates fast inference speeds while maintaining high accuracy, a significant advantage in real-time tracking systems. The model establishes a new benchmark for online models on the TAP-Vid benchmark and shows competitive performance against offline methods.

Weaknesses

The paper mentions potential precision loss when tracking thin surfaces or instances with similar appearances, indicating limitations in handling certain visual effects. The model may struggle to establish correct correspondences in complex scenes with high visual similarity, affecting tracking accuracy. While the model is memory-efficient, there is a trade-off between memory module size and inference speed, which may require balancing in practical applications.

Reviewer 02Rating 8Confidence 4

Strengths

- The paper is well-written and easy to follow, with extensive experiments that clearly demonstrate the performance improvements contributed by each module. - The authors propose an effective approach using patch classification, rather than regression, to predict each point location; while patch classification was mentioned in the PIPs paper, it was previously used only to accelerate convergence by supervising score maps. - The authors propose 2 memory modules that effectively improve the track

Weaknesses

- line 289: I think you meant $\gamma^s$ not $\gamma^q$ which does not appears in eq.8. - Table 1: the authors should show the backbone used in each approach for easier comparison. If possible, the methods should be compared on the same backbone. However, this is alleviated by table 3 where the authors have show the effectiveness of the memory modules.

Reviewer 03Rating 5Confidence 5

Strengths

1.The author designed two memory modules to mine spatio-temporal information and use coarse to fine manner for more accurate point prediction. 2.Experiments demonstrate the effectiveness of the method proposed by the authors and obtain the SOTA performance of the online point tracking method.

Weaknesses

1.Experiments were inadequate. The designed analysis and the layer number ablation analysis of each modules are not enough, including several decoders, the number (4) of level of similar map used for patch classification, etc. 2.The figures in the paper are not clear enough. For example, in the left of Figure 5, the input qinit does not understand what it means, because q has no subscript and does not specify its meaning. In additional, the title of Figure 5 should go from left to right when in

Code & Models

Repositories

gorkaydemir/track_on
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInertial Sensor and Navigation · Advanced Vision and Imaging · Iterative Learning Control Systems