Sparse3DTrack: Monocular 3D Object Tracking Using Sparse Supervision
Nikhil Gosala, B. Ravi Kiran, Senthil Yogamani, Abhinav Valada

TL;DR
Sparse3DTrack introduces a novel framework for monocular 3D object tracking that leverages sparse supervision and pseudo-labeling to achieve high performance with minimal annotations, reducing labeling costs.
Contribution
It is the first sparsely supervised approach for monocular 3D tracking, decomposing the task into 2D matching and 3D estimation, and generating dense pseudo-labels from sparse annotations.
Findings
Achieves up to 15.50 percentage points improvement on KITTI and nuScenes datasets.
Operates effectively with at most four ground truth annotations per track.
Significantly reduces the need for dense 3D annotations in training.
Abstract
Monocular 3D object tracking aims to estimate temporally consistent 3D object poses across video frames, enabling autonomous agents to reason about scene dynamics. However, existing state-of-the-art approaches are fully supervised and rely on dense 3D annotations over long video sequences, which are expensive to obtain and difficult to scale. In this work, we address this fundamental limitation by proposing the first sparsely supervised framework for monocular 3D object tracking. Our approach decomposes the task into two sequential sub-problems: 2D query matching and 3D geometry estimation. Both components leverage the spatio-temporal consistency of image sequences to augment a sparse set of labeled samples and learn rich 2D and 3D representations of the scene. Leveraging these learned cues, our model automatically generates high-quality 3D pseudolabels across entire videos, effectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Face recognition and analysis
