VideoMaMa: Mask-Guided Video Matting via Generative Prior
Sangbeom Lim, Seoung Wug Oh, Jiahui Huang, Heeji Yoon, Seungryong Kim, Joon-Young Lee

TL;DR
VideoMaMa introduces a novel approach leveraging pretrained diffusion models to convert coarse masks into accurate video mattes, enabling zero-shot generalization and large-scale dataset creation for improved real-world video matting.
Contribution
The paper presents VideoMaMa, a method that uses generative priors for mask-guided video matting, and introduces the MA-V dataset for large-scale training and evaluation.
Findings
VideoMaMa achieves strong zero-shot generalization to real-world videos.
The MA-V dataset contains over 50,000 annotated videos across diverse scenes.
Fine-tuning SAM2 on MA-V improves robustness in in-the-wild video matting.
Abstract
Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video (MA-V) dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MA-V to obtain SAM2-Matte, which outperforms the same model trained on existing matting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Enhancement Techniques · Generative Adversarial Networks and Image Synthesis · Image and Video Quality Assessment
