TL;DR
3AM enhances video object segmentation by integrating 3D-aware features into SAM2, achieving geometry-consistent recognition without requiring camera poses or preprocessing, and significantly outperforming existing methods.
Contribution
The paper introduces 3AM, a novel training-time enhancement that fuses 3D-aware features with appearance features for improved geometry consistency in video segmentation.
Findings
Achieves 90.6% IoU on ScanNet++ dataset.
Outperforms state-of-the-art VOS methods by +15.9 IoU points.
Requires only RGB input at inference, no camera poses or preprocessing.
Abstract
Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing. We introduce 3AM, a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi-level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2's appearance features, the model achieves geometry-consistent recognition grounded in both spatial position and visual similarity. We propose a field-of-view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning. Critically, our method requires only RGB input at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
