Completing Visual Objects via Bridging Generation and Segmentation
Xiang Li, Yinpeng Chen, Chung-Ching Lin, Hao Chen, Kai Hu, Rita Singh,, Bhiksha Raj, Lijuan Wang, Zicheng Liu

TL;DR
This paper introduces MaskComp, a novel iterative method that combines generation and segmentation to progressively refine partial object masks, achieving superior object completion results compared to existing approaches.
Contribution
MaskComp uniquely integrates generation and segmentation stages in an iterative process to improve object completion accuracy.
Findings
MaskComp outperforms ControlNet and Stable Diffusion in object completion tasks.
Iterative refinement of masks leads to more accurate object shape reconstruction.
The combined generation-segmentation approach acts as an effective mask denoiser.
Abstract
This paper presents a novel approach to object completion, with the primary goal of reconstructing a complete object from its partially visible components. Our method, named MaskComp, delineates the completion process through iterative stages of generation and segmentation. In each iteration, the object mask is provided as an additional condition to boost image generation, and, in return, the generated images can lead to a more accurate mask by fusing the segmentation of images. We demonstrate that the combination of one generation and one segmentation stage effectively functions as a mask denoiser. Through alternation between the generation and segmentation stages, the partial object mask is progressively refined, providing precise shape guidance and yielding superior object completion results. Our experiments demonstrate the superiority of MaskComp over existing approaches, e.g.,…
Peer Reviews
Decision·ICML 2024 Poster
Clever idea improved the result of occluded object completion effectively. The recent progress of image generation models are actively analyzed and the authors found useful problem task. In the quantitative evaluation the FID metric shows significant performance compared other method, and the numbers are convinced by showing qualitative results. Greatly overcame the random unstable results which occurs frequently from the image generation model by averaging the results of multiple runs.
While this paper has attractive strengths, this research is rather applicational research that exploits good features of prior researches. Considering the overall direction of the papers presented in this conference (ICLR), readers may expect more theoretical idea or fundamental thas can be transferrable to of stimulate other research. This paper is heavely dependent on Zhang 2023 paper.
1. The proposed method (MaskComp) introduces a novel interactive approach to complete an object by generating object masks as guidance. 1. Section 3.3 tries to give some theoretical analysis of MaskComp, which is interesting and helps to understand the benefit of introducing masks in the generation approach. 1. From the visual results, I find MaskComp completes the input partial object and achieves higher perceptual quality compared to other methods. Quantitatively, it also achieves higher FID
1. The technical contributions of the proposed method could be further improved. For now, the proposed method is mainly a mask-guided stable diffusion model with an off-the-shelf segmentation model to produce the mask condition. Using SAM to generate masks is straightforward and the mask voting process gives no surprises. 1. If generating a mask is the key to generating high-quality images, why not directly use an encoder-decoder model like U-Net or an SD to predict the target complete mask in
- The proposed joint object image and mask completion strategy is well-motivated and the overall method seems novel for the target task. - The paper is mostly easy to follow. - The experiments include both automatic and human-based metrics for evaluation, and the results are better than baselines.
- The justification for the entire iterative procedure is lacking. While it is ideal to achieve improvements as shown in Figure 5, there is no guarantee that such improvement can be realized in a realistic setting. In particular, the mask-denoising controlnet is trained in a local manner, which may generate a worse image, and the segmentation stage is largely dependent on the segmentation model S, which may produce noisy segmentation output. Therefore, it is unclear whether this design would wor
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Neural Network Applications · Image Processing Techniques and Applications
MethodsDiffusion
