TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels
Jiahao Lu, Weitao Xiong, Jiacheng Deng, Peng Li, Tianyu Huang, Zhiyang Dou, Cheng Lin, Sai-Kit Yeung, Yuan Liu

TL;DR
TrackingWorld introduces a novel dense 3D tracking pipeline that effectively separates camera and dynamic object motions, enabling accurate, world-centric monocular 3D tracking of nearly all pixels in videos.
Contribution
The paper presents a new dense 3D tracking method that lifts sparse 2D tracks to dense tracks and estimates world-centric 3D trajectories, addressing limitations of previous methods.
Findings
Achieves accurate dense 3D tracking on synthetic and real datasets.
Effectively separates camera motion from dynamic object motion.
Handles newly emerging objects in videos.
Abstract
Monocular 3D tracking aims to capture the long-term motion of pixels in 3D space from a single monocular video and has witnessed rapid progress in recent years. However, we argue that the existing monocular 3D tracking methods still fall short in separating the camera motion from foreground dynamic motion and cannot densely track newly emerging dynamic subjects in the videos. To address these two limitations, we propose TrackingWorld, a novel pipeline for dense 3D tracking of almost all pixels within a world-centric 3D coordinate system. First, we introduce a tracking upsampler that efficiently lifts the arbitrary sparse 2D tracks into dense 2D tracks. Then, to generalize the current tracking methods to newly emerging objects, we apply the upsampler to all frames and reduce the redundancy of 2D tracks by eliminating the tracks in overlapped regions. Finally, we present an efficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Advanced Vision and Imaging
