Long-Term 3D Point Tracking By Cost Volume Fusion
Hung Nguyen, Chanho Kim, Rigved Naukarkar, Li Fuxin

TL;DR
This paper introduces a novel deep learning framework for long-term 3D point tracking that effectively fuses past information using a transformer-based cost volume, outperforming previous 2D and scene flow methods.
Contribution
It presents the first 3D deep learning approach for long-term point tracking that generalizes without test-time fine-tuning, utilizing a cost volume fusion module with transformer architecture.
Findings
Outperforms simple scene flow chaining in 3D tracking.
Surpasses previous 2D point tracking methods in accuracy.
Works effectively without test-time fine-tuning.
Abstract
Long-term point tracking is essential to understand non-rigid motion in the physical world better. Deep learning approaches have recently been incorporated into long-term point tracking, but most prior work predominantly functions in 2D. Although these methods benefit from the well-established backbones and matching frameworks, the motions they produce do not always make sense in the 3D physical world. In this paper, we propose the first deep learning framework for long-term point tracking in 3D that generalizes to new points and videos without requiring test-time fine-tuning. Our model contains a cost volume fusion module that effectively integrates multiple past appearances and motion information via a transformer architecture, significantly enhancing overall tracking performance. In terms of 3D tracking performance, our model significantly outperforms simple scene flow chaining and…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper introduces a deep learning-based framework for long-term 3D point tracking, a first in its domain according to the authors' claims. This approach is impressive how it combines various techniques to address a complex problem. 2. The authors provided extensive experimental results over various benchmarks to verify their claims.
1. The authors claim this is the first method for 3D point tracking. However, a highly related work SpatialTracker [A] is neither discussed or compared here. Although SpatialTracker works on 2D images and depths instead of directly on 3D points, its input and output are exactly the same to this work. The authors should discuss and compare with SpatialTracker to verify their points. This is also related to another question, "is it really necessary to lift depths to 3D points for 3D tracking?" 2
Utilizes a transformer-based cost volume fusion module to handle occlusions and integrate long-term appearance and motion information. Extensive performance evaluation demonstrates superiority over 2D methods, especially in occluded scenarios.
As the paper claimed that it is the first method to achieve 3D point tracking without test-time optimization, to the reviewer's best of knowledge, some works such as FlowNet3D could also predict point cloud tracking results without test-time optimization. The difference between the proposed method and these methods is not clear. As the paper strengthens that it does not need test-time optimization, a brief comparison of runtime performance would enhance the practical applicability discussion.
- I like the paper's idea and the problem it tackles. The problem setting is quite close to scene flow - while there have been many scene flow papers recently, it remains an unsolved problem. The paper's main novelty lies in its network architecture. Although its individual components aren't novel (the overall architecture resembles recent point tracking papers like PIPs, Harley et al.), applying it to long-term 3D tracking is a nice contribution. - The paper is well-written and well-structured
* Missing evaluations on real data: Although the paper has extensive evaluations against recent scene flow methods, it lacks any evaluation or even qualitative results on real data (I checked the supplementary material as well). While I understand that real data doesn't always provide good ground truth, especially for dynamic objects, making evaluation challenging, the absence of results on common benchmarks like the KITTI Scene Flow dataset is unfortunate. The complete lack of real-data results
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Measurement and Metrology Techniques · Image and Object Detection Techniques
