CineMatte: Background Matting for Virtual Production and Beyond
Yuanjian He, Chen Zhang, Fasheng Chen, Jiangbo Cao

TL;DR
CineMatte introduces a robust background matting framework for virtual production that employs a Siamese Vision Transformer with cross-attention, improving boundary detail recovery and generalization to real-world footage.
Contribution
The paper presents CineMatte, a novel background matting method using a cross-attention-conditioned ViT and introduces CineMatte-4K, a new high-resolution VP matting dataset.
Findings
CineMatte outperforms existing models on VP and real-world benchmarks.
The new dataset enables training and evaluation of VP matting in real-world conditions.
Replacing the detail branch with a pretrained feature upsampler reduces boundary artifacts.
Abstract
LED Virtual Production (VP) uses large LED volumes to render backgrounds in real time, enabling in-camera visual effects but making post-shot changes labor-intensive. We address this with CineMatte, a robust background matting framework for VP and beyond. CineMatte employs a cross-attention-conditioned design. Instead of concatenating the background with the input, CineMatte employs a Siamese, frozen DINOv3 Vision Transformer with shared weights to encode the input frame and the captured background separately. A cross-attention module compares the two streams to predict the foreground, preserving pretrained semantics and improving robustness to background shifts. Previous ViT-based matting models use a parallel convolutional "detail branch" to recover fine details, which can cause boundary artifacts in real-world samples due to semantic misalignment with the backbone. We instead replace…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
