CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

Inseok Jeon; Suhwan Cho; Minhyeok Lee; Seunghoon Lee; Minseok Kang; Jungho Lee; Chaewon Park; Donghyeong Kim; Sangyoun Lee

arXiv:2604.14630·cs.CV·April 17, 2026

CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

Inseok Jeon, Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Minseok Kang, Jungho Lee, Chaewon Park, Donghyeong Kim, Sangyoun Lee

PDF

TL;DR

This paper introduces a novel cross-modality token modulation approach that enhances the interaction between appearance and motion cues in unsupervised video object segmentation, achieving state-of-the-art results.

Contribution

It proposes a dense token connection method with relation transformer blocks and a token masking strategy to improve learning efficiency and inter-modal information propagation.

Findings

01

Achieves state-of-the-art performance on all public benchmarks.

02

Outperforms existing methods in unsupervised video object segmentation.

03

Enhances inter-modal interaction through dense token connections.

Abstract

Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.