MaGIC: Multi-modality Guided Image Completion
Yongsheng Yu, Hao Wang, Tiejian Luo, Heng Fan, Libo Zhang

TL;DR
MaGIC introduces a flexible multi-modal guided image completion method that effectively combines various guidance modalities without retraining, outperforming existing approaches in plausibility and adaptability.
Contribution
The paper presents MaGIC, a novel multi-modal guided image completion framework supporting arbitrary modality combinations with a training-free blending method, enhancing scalability and performance.
Findings
Outperforms state-of-the-art methods in image completion tasks.
Supports arbitrary combinations of guidance modalities.
Demonstrates strong generalization across diverse completion scenarios.
Abstract
Vanilla image completion approaches exhibit sensitivity to large missing regions, attributed to the limited availability of reference information for plausible generation. To mitigate this, existing methods incorporate the extra cue as a guidance for image completion. Despite improvements, these approaches are often restricted to employing a single modality (e.g., segmentation or sketch maps), which lacks scalability in leveraging multi-modality for more plausible completion. In this paper, we propose a novel, simple yet effective method for Multi-modal Guided Image Completion, dubbed MaGIC, which not only supports a wide range of single modality as the guidance (e.g., text, canny edge, sketch, segmentation, depth, and pose), but also adapts to arbitrarily customized combination of these modalities (i.e., arbitrary multi-modality) for image completion. For building MaGIC, we first…
Peer Reviews
Decision·ICLR 2024 poster
**Innovative and Flexible Approach** The paper addresses the challenging problem of multi-modality-guided image completion. It proposes a new simple training-free procedure, allowing for various guidance modalities, such as text, edge, sketch, segmentation, depth, and pose. ** Large Consistent Gains** The paper shows consistent and significant improvements over state-of-the-art approaches, particularly in image quality.
**Clarity and Typos** The paper is challenging to follow and contains multiple typos, which can impede understanding. Improved clarity in the presentation and thorough proofreading would enhance the paper's quality. **Non-standard Update Scheme** The update scheme presented in equation (5) appears inhomogeneous, as it involves gradient descent with respect to $z_t$ but updates $z'_{t-1}$. This choice could be a reasonable heuristic but is not discussed or justified, which leaves questions about
I like the extension of classifier guidance to multiple modalities that too training-free. Similar techniques has been explored in other single-modality context like in Sketch-Guided Text-to-Image Diffusion Models, but extending to multi-modal case is a nice extension. The qualitative comparisons are very intuitive (especially with T2I-Adapter and ControlNet). The overall presentation is reasonable and easy to follow. The authors included substantial appendix sections, detailing several archit
While this is an interesting piece of work, I have some big gripes (please let me know if I understood it wrong): In related work section (page 3 last paragraph), the authors give the impression that T2I-Adapter (and ControlNet) "fails to simultaneously use multi-modality as guidance". This is completely wrong. I understand that T2I-Adapter do not explicitly train jointly for multiple modalities, but they can combine multiple modalities (see section 4.3.2 in T2I-Adapter). Second, the authors n
1. Dealing with LARGE missing regions is a critical task in image completion. This topic is of broad interest in the ML and image processing community. 2. The idea of leveraging multiple resources is nice though not ground-breaking novel. Making is scalable and flexible is the key, which is solved by two stage approach: modality oriented conditional network and across-modality blending. 3. The approach is integrated into the diffusion process neatly and training-free. 3. The paper is very well w
1. I'm not fully convinced that different image channels/features, such as depth, sketch, edge, could be called modality. 2. The fair comparison is not easy since most SOTA are not considering multiple resources in the same time. It'd be nice to share some insight into this, and share failure cases.
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection · Advanced Neural Network Applications
MethodsConcatenated Skip Connection · *Communicated@Fast*How Do I Communicate to Expedia? · Convolution · Max Pooling · U-Net
