TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video

Jinyuan Qu; Hongyang Li; Shilong Liu; Tianhe Ren; Zhaoyang Zeng; Lei Zhang

arXiv:2411.18671·cs.CV·September 29, 2025

TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video

Jinyuan Qu, Hongyang Li, Shilong Liu, Tianhe Ren, Zhaoyang Zeng, Lei Zhang

PDF

Open Access 3 Reviews

TL;DR

TAPTRv3 enhances long video point tracking by integrating spatial and temporal context through novel attention mechanisms, significantly improving robustness and accuracy over previous methods.

Contribution

The paper introduces Context-aware Cross-Attention and Visibility-aware Long-Temporal Attention to improve feature querying in long videos for robust point tracking.

Findings

01

Surpasses TAPTRv2 on multiple challenging datasets

02

Achieves state-of-the-art performance in long video point tracking

03

Outperforms methods trained on large-scale internal data

Abstract

In this paper, built upon TAPTRv2, we present TAPTRv3. TAPTRv2 is a simple yet effective DETR-like point tracking framework that works fine in regular videos but tends to fail in long videos. TAPTRv3 improves TAPTRv2 by addressing its shortcomings in querying high-quality features from long videos, where the target tracking points normally undergo increasing variation over time. In TAPTRv3, we propose to utilize both spatial and temporal context to bring better feature querying along the spatial and temporal dimensions for more robust tracking in long videos. For better spatial feature querying, we identify that off-the-shelf attention mechanisms struggle with point-level tasks and present Context-aware Cross-Attention (CCA). CCA introduces spatial context into the attention mechanism to enhance the quality of attention scores when querying image features. For better temporal feature…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The introduction of Context-aware Cross-Attention (CCA) and Visibility-aware Long-Temporal Attention (VLTA) seems reasonable. 2. The auto-triggered global matching mechanism is easy to follow. 3. This paper provides extensive quantitative results on the public tracking benchmarks, demonstrating the method's robustness, and efficiency.

Weaknesses

1. The overall framework of TAPTRv3 is highly similar to TAPTRv2. In TAPTRv2, it also introduces Attention-based Position Update and visibility classifier to maintain the temporal consistency. Despite the implementation variations, the key idea of this work is not very novel. 2. The performance gains compared to CoTracker3 and Track-On in Table 1 are marginal. 3. The auto-triggered global matching is simply a global redetection mechanism, which has been widely explored in other tracking framewor

Reviewer 02Rating 4Confidence 4

Strengths

- This paper focuses on improvements such as enhancing context-aware capabilities and modifying attention weights using visibility, which are highly intuitive and also suitable for video tasks. - This paper achieves good performance on multiple datasets. - The ablation experiments for the newly proposed modules in this paper are conducted in detail.

Weaknesses

- The key improvement of this paper lies in proposing two types of attention mechanisms to replace previous RNN-like methods. However, there are now many improved RNN-like neural networks, such as readily applicable Mamba and RWKV. The paper lacks sufficient explanation for these methods; could improved recurrent structures also alleviate the challenges in long-context modeling? - In visibility-aware attention, the authors modify attention weights using predicted visibility. While this operation

Reviewer 03Rating 6Confidence 4

Strengths

- The formulation for the proposed method is simple and reasonable, and contributions claimed by the authors are straightforward and clear. - The motivation of the proposed method is clear, where failure cases of TAPTRv2 including scenarios with long-term temporal drift and scene cuts are well-addressed and quantified. - The authors performed ablation experiments that validate the effectiveness of each proposed component. The results in Table 2 show that each proposed module (VLTA, CCA, etc.)

Weaknesses

- The main experimental comparisons are performed on the subsets of TAP-Vid benchmark, and albeit them being oriented for long-term tasks compared to previous benchmarks such as DAVIS, they seem still short in terms of temporal length. Datasets such as PointOddyssey [a] contain longer sequences and is more oriented for evaluating the proposed method. - [a] Yang et al., PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking., ICCV 2023. - Although the proposed framework s

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image Processing Techniques · Image and Video Quality Assessment · Image and Signal Denoising Methods

MethodsSoftmax · Attention Is All You Need