PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation
Dingwen Xiao, Weiming Zhang, Shiqi Wen, Lin Wang

TL;DR
PanoSAM2 is a lightweight adaptation of SAM2 designed for 360 video object segmentation, addressing distortion, memory, and semantic issues to improve temporal coherence and accuracy.
Contribution
It introduces novel distortion-aware decoding, a distortion-guided loss, and a long-short memory module for effective 360VOS with minimal additional complexity.
Findings
Achieves +5.6 improvement on 360VOTS
Achieves +6.7 improvement on PanoVOS
Demonstrates significant gains over SAM2 in 360VOS tasks
Abstract
360 video object segmentation (360VOS) aims to predict temporally-consistent masks in 360 videos, offering full-scene coverage, benefiting applications, such as VR/AR and embodied AI. Learning 360VOS model is nontrivial due to the lack of high-quality labeled dataset. Recently, Segment Anything Models (SAMs), especially SAM2 -- with its design of memory module -- shows strong, promptable VOS capability. However, directly using SAM2 for 360VOS yields implausible results as 360 videos suffer from the projection distortion, semantic inconsistency of left-right sides, and sparse object mask information in SAM2's memory. To this end, we propose PanoSAM2, a novel 360VOS framework based on our lightweight distortion- and memory-aware adaptation strategies of SAM2 to achieve reliable 360VOS while retaining SAM2's user-friendly prompting design. Concretely, to tackle the projection distortion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
