TL;DR
VGGT-Segmentor is a novel framework that combines geometric modeling with pixel-accurate segmentation for cross-view object segmentation, achieving state-of-the-art results without paired annotations.
Contribution
It introduces a new segmentation head and a self-supervised training strategy, improving dense prediction accuracy in cross-view scenarios.
Findings
Achieves 67.7% and 68.0% average IoU on Ego-Exo4D benchmark.
Outperforms prior methods in cross-view segmentation tasks.
Pretrained model surpasses many fully-supervised baselines.
Abstract
Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
