CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation
Inseok Jeon, Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Minseok Kang, Jungho Lee, Chaewon Park, Donghyeong Kim, Sangyoun Lee

TL;DR
This paper introduces a novel cross-modality token modulation approach that enhances the interaction between appearance and motion cues in unsupervised video object segmentation, achieving state-of-the-art results.
Contribution
It proposes a dense token connection method with relation transformer blocks and a token masking strategy to improve learning efficiency and inter-modal information propagation.
Findings
Achieves state-of-the-art performance on all public benchmarks.
Outperforms existing methods in unsupervised video object segmentation.
Enhances inter-modal interaction through dense token connections.
Abstract
Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
