MV-SAM: Multi-view Promptable Segmentation using Pointmap Guidance
Yoonwoo Jeong, Cheng Sun, Yu-Chiang Frank Wang, Minsu Cho, Jaesung Choe

TL;DR
MV-SAM introduces a 3D-aware multi-view segmentation framework that uses pointmaps to achieve consistent results across views without explicit 3D training, outperforming existing methods.
Contribution
The paper presents MV-SAM, a novel multi-view segmentation approach that lifts 2D prompts into 3D space using pointmaps, eliminating the need for 3D data or networks.
Findings
Outperforms SAM2-Video on multiple benchmarks.
Achieves comparable results to per-scene optimization methods.
Generalizes well across various datasets.
Abstract
Promptable segmentation has emerged as a powerful paradigm in computer vision, enabling users to guide models in parsing complex scenes with prompts such as clicks, boxes, or textual cues. Recent advances, exemplified by the Segment Anything Model (SAM), have extended this paradigm to videos and multi-view images. However, the lack of 3D awareness often leads to inconsistent results, necessitating costly per-scene optimization to enforce 3D consistency. In this work, we introduce MV-SAM, a framework for multi-view segmentation that achieves 3D consistency using pointmaps -- 3D points reconstructed from unposed images by recent visual geometry models. Leveraging the pixel-point one-to-one correspondence of pointmaps, MV-SAM lifts images and prompts into 3D space, eliminating the need for explicit 3D networks or annotated 3D data. Specifically, MV-SAM extends SAM by lifting image…
Peer Reviews
Decision·Submitted to ICLR 2026
Overall, the paper is clear and descriptive about the different components of the proposed method. The problem statement is well-scoped and the method section includes all details and motivations behind the design choices. The core contribution of the method is enabling 3D awareness without 3D supervision by using the pointmap representation. It requires no per-scene optimization allowing for generalization to various datasets. Results show considerable improvement over SAM2-Video and being comp
- The method relies on a pretrained visual geometry model, making it dependent on the accuracy of the pointmap reconstruction. The error in the pointmap reconstruction can propagate to the final mask prediction. - In Figure 4, I suggest reducing the opacity of the truck in the reference image to make it more visible and easier to interpret. - The discussion on the limitations is not included in the main text. I recommend that the authors include a discussion with some examples where the method f
The technical idea is well-motivated and easy to follow, building off existing work. Enhancing SAM2-Video embeddings with 3D positional information improves consistency across views. Results show significant improvement over SAM2-Video across NVOS and SPIn-NeRF benchmarks. They achieve competitive performance with optimization-based methods without requiring per-scene fitting. The ablations are informative, particularly Table 3a which evaluates key decoder design choices. They systematically c
The comparison against generalization baselines is limited to SAM2-Video alone. It would strengthen the paper to include other video or multi-view segmentation methods that don't require per-scene optimization. The method relies heavily on Pi3 for pointmap generation, but the technical sections provide limited detail on how Pi3 works. Given that Pi3 appears to do much of the heavy lifting, it's unclear how much of the contribution is genuinely novel versus simply combining existing components (
1. The paper presents an original and well-motivated idea. By leveraging the power of 3D reconstruction methods and obtaining a 3D point coordinate for each pixel, it is possible to make the segmentation model aware of the 3D structure of the scene, enabling consistency between the segmentation masks predicted for different images of the same scene. 2. The effectiveness of the proposed method that leverages this idea, MV-SAM, is properly demonstrated through experiments. Across various datasets
1. This is not a major weakness, but it is not clear what the efficiency is of the proposed MV-SAM method compared to existing method SAM2. I can imagine that running $\pi^3$ for each scene introduces a significant computational overhead. The paper would be stronger if it provided insights into the runtime, number of parameters, and number of FLOPs for both MV-SAM and SAM2. MV-SAM would still be valuable if it were less efficient than SAM2, but information about their relative efficiency would p
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Human Pose and Action Recognition
