DELTA: Dense Efficient Long-range 3D Tracking for any video

Tuan Duc Ngo; Peiye Zhuang; Chuang Gan; Evangelos Kalogerakis; Sergey; Tulyakov; Hsin-Ying Lee; Chaoyang Wang

arXiv:2410.24211·cs.CV·March 3, 2025

DELTA: Dense Efficient Long-range 3D Tracking for any video

Tuan Duc Ngo, Peiye Zhuang, Chuang Gan, Evangelos Kalogerakis, Sergey, Tulyakov, Hsin-Ying Lee, Chaoyang Wang

PDF

Open Access 3 Reviews

TL;DR

DELTA is a fast, dense 3D tracking method for videos that achieves pixel-level accuracy over long sequences using a novel attention-based approach, outperforming existing techniques in speed and precision.

Contribution

We introduce DELTA, a novel dense 3D tracking framework that combines global-local attention and transformer upsampling for efficient, high-resolution motion estimation in videos.

Findings

01

DELTA runs over 8x faster than previous methods.

02

Achieves state-of-the-art accuracy in 2D and 3D dense tracking.

03

Log-depth representation enhances tracking performance.

Abstract

Tracking dense 3D motion from monocular videos remains challenging, particularly when aiming for pixel-level precision over long sequences. We introduce DELTA, a novel method that efficiently tracks every pixel in 3D space, enabling accurate motion estimation across entire videos. Our approach leverages a joint global-local attention mechanism for reduced-resolution tracking, followed by a transformer-based upsampler to achieve high-resolution predictions. Unlike existing methods, which are limited by computational inefficiency or sparse tracking, DELTA delivers dense 3D tracking at scale, running over 8x faster than previous methods while achieving state-of-the-art accuracy. Furthermore, we explore the impact of depth representation on tracking performance and identify log-depth as the optimal choice. Extensive experiments demonstrate the superiority of DELTA on multiple benchmarks,…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The proposed method indeed remedy the efficiency issue of previous SOTA methods significantly. 2. The experiments are comprehensive and convincing.

Weaknesses

Some key concepts and design choices are not well explained. 1. Why do we need certain anchor tracks? What if we sample some certain points in the dense tracks? 2. The difference of sparse tracks and dense tracks are not well explained. For the comparisons in subfigure 2 and 3 in Fig.3, the dimension of sparse tracks and dense tracks are both T x N, then why it is called dense tracks? What is N' and L, the meaning of these two parameters are not well defined and explained. 3. If the depth i

Reviewer 02Rating 6Confidence 5

Strengths

This is one of the first papers addressing the TAP-3D problem with a feedforward model (not requiring test-time optimization), along with SceneTracker and SpatialTracker, which makes it a valuable contribution in this newly developing field. In addition, the authors present quantitative results on both 2D tracking and 3D tracking which demonstrate the good performance of the model. The proposed model obtains SOTA results on the CVO benchmark as well as in the TAPVid-3D benchmark. The visualiza

Weaknesses

While the paper is a valuable contribution with good results, the model design seems to be mainly a combination of ideas from SceneTracker and CoTracker. The paper does not clearly state which ideas are novel, and which are borrowed from these previous methods. Furthermore, the experimental section only presents 2D tracking results on CVO, which is not the most widespread 2D tracking benchmark. Results on TAPVid-DAVIS would help in demonstrating the SOTA status of this model for 2D tracking.

Reviewer 03Rating 6Confidence 3

Strengths

1. TAPE3D introduces the spatial-temporal transformer to extract more visual features and uses the Upsampler module to obtain high-resolution results. 2. TAPE3D delivers dense 3D tracking at scale, running over 8x faster than previous methods while achieving state-of-the-art accuracy.

Weaknesses

1. The idea of using the spatial-temporal transformer is not new, such as [1][2][3]. 2. The authors are suggested to provide the compilation cost of each module to verify the efficiency of TAPE3D. 3. Are all inference experiments testing on a same machine? The information of the machine including GPU and CPU are suggested to provide. [1] Hu M, Zhu X, Wang H, et al. Stdformer: Spatial-temporal motion transformer for multiple object tracking[J]. IEEE Transactions on Circuits and Systems for Video

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Image Enhancement Techniques · Video Surveillance and Tracking Methods

MethodsSoftmax · Attention Is All You Need · Global-Local Attention