Towards Natural Image Matting in the Wild via Real-Scenario Prior
Ruihao Xia, Yu Liang, Peng-Tao Jiang, Hao Zhang, Qianru Sun, Yang, Tang, Bo Li, Pan Zhou

TL;DR
This paper introduces COCO-Matting, a large real-world dataset for natural image matting, and proposes SEMat, a novel architecture that leverages pre-trained SAM features for improved matting performance in complex scenes.
Contribution
The paper presents a new real-world dataset COCO-Matting and a transformer-based network SEMat that enhances feature extraction and training objectives for natural image matting.
Findings
COCO-Matting contains 38,251 complex natural scene alpha mattes.
SEMat outperforms existing methods across seven diverse datasets.
The approach effectively generalizes to complex and occlusion scenes.
Abstract
Recent approaches attempt to adapt powerful interactive segmentation models, such as SAM, to interactive matting and fine-tune the models based on synthetic matting datasets. However, models trained on synthetic data fail to generalize to complex and occlusion scenes. We address this challenge by proposing a new matting dataset based on the COCO dataset, namely COCO-Matting. Specifically, the construction of our COCO-Matting includes accessory fusion and mask-to-matte, which selects real-world complex images from COCO and converts semantic segmentation masks to matting labels. The built COCO-Matting comprises an extensive collection of 38,251 human instance-level alpha mattes in complex natural scenarios. Furthermore, existing SAM-based matting methods extract intermediate features and masks from a frozen SAM and only train a lightweight matting decoder by end-to-end matting losses,…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
+ Good motivation to construct a large-scale real-world human image matting dataset featuring complex scenes, annotated with human instance-level alpha mattes. + Feasible attempts for designing a matting-adapted SAM-based framework, with aligned network architecture and training losses. The framework seems effective. + Extensive experiments on seven datasets. + The paper is easy to follow.
i) Dataset - Inaccurate annotations. The motivation to create the COCO-Matting is good, considering the natural and complex scenes, but I have two major concerns about the generation procedures of the annotations. First, I wonder whether the ‘Accessory Fusion’ step would still lead to cases where objects that do not correspond to the target instance are retained. For example, will the cup be retained if another human instance holds a cup that blocks part of the current instance's body? Also, in
Authors propose a new large-scale dataset for matting but I felt there are more concerns on that. Please see weakness.
1. The major contributions for this paper is the dataset. However, I have a very big concern on this part. The way how authors generate the final 'high-quality' alpha matte is using a off-the-shelf method ."Finally,we forwardt he following trimapT(x) to the trained trimap-based network(Hu etal.,2023) and treat it as target. My question is that since the GT is generated via another trimap-based matting method, meaning that's the highest quality you could get. Then, what's the purpose of your
+The authors propose COCO-Matting, which is a valuable dataset for interactive matting. The results show that with COCO-Matting, which contains lots of natural images instead of synthetic images. +The paper is well organized and presented, and the figures are also clear. +The performance of SEMat is impressive, topping all the mentioned datasets.
Despite the strong performance and valuable datasets, I still have some concerns. 1. Almost all of current matting datasets are manually labeled, but the authors just use trimap-based matting models to label it. So, whether the label can be called ground-truth is a question to be discussed. Is there any data selecting to verify the correctness of the dataset? 2. From the results, the update of SAM version can introduce the improvement, so I wonder the result if MatAny is equipped with SAM2. A
1. The paper propose a pipeline to construct a new matting dataset based on a dataset of image segmentation. In this way, the matting model can be trained on a big and real dataset. 2. I think it is a highlight to predict trimap from SAM. Because a precise trimap is necessary for the matting decoder.
1. Novelty is not enough. Even though some tricks are used to construct a dataset, the matting model do not contain enough novel method. 2. The intro about matting dataset missed some existing datasets, training set and test set. For instance, RefMatte, ICM-57 and so on. 3. The ablation about the trimap loss is not clear. If you do not use the loss, what is used in your matting decoder to replace the trimap? In fact, I am concerned about the contribution of predicting trimap from SAM.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Enhancement Techniques · Microwave Imaging and Scattering Analysis
MethodsSegment Anything Model
