M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking
Qiangqiang Wu, Tianyu Yang, Bo Fang, Jia Wan, Matias Di Martino, Guillermo Sapiro, Antoni B. Chan

TL;DR
This paper introduces Mask-to-Point (M2P), a weakly-supervised learning method that enhances visual foundation models for dense point tracking in videos by leveraging video object segmentation masks and novel constraints.
Contribution
M2P is the first to incorporate mask-based constraints into foundation models for dense point tracking, significantly improving performance with minimal training data.
Findings
M2P outperforms baseline VFMs on TAP-Vid-DAVIS benchmark.
Achieves 12.8% and 14.6% performance gains over DINOv2-B/14 and DINOv3-B/16.
Uses only 3.6K VOS videos for training.
Abstract
Tracking Any Point (TAP) has emerged as a fundamental tool for video understanding. Current approaches adapt Vision Foundation Models (VFMs) like DINOv2 via offline finetuning or test-time optimization. However, these VFMs rely on static image pre-training, which is inherently sub-optimal for capturing dense temporal correspondence in videos. To address this, we propose Mask-to-Point (M2P) learning, which leverages rich video object segmentation (VOS) mask annotations to improve VFMs for dense point tracking. Our M2P introduces three new mask-based constraints for weakly-supervised representation learning. First, we propose a local structure consistency loss, which leverages Procrustes analysis to model the cohesive motion of points lying within a local structure, achieving more reliable point-to-point matching learning. Second, we propose a mask label consistency (MLC) loss, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Advanced Vision and Imaging
