TL;DR
This paper introduces DA-CLIP, a novel multi-task framework that adapts pretrained vision-language models for high-fidelity image restoration, effectively handling various degradation types and improving state-of-the-art results.
Contribution
The paper proposes a degradation-aware adaptation of CLIP with a controller for multi-task image restoration, enabling better transfer from large-scale pretrained models to low-level vision tasks.
Findings
DA-CLIP achieves state-of-the-art performance on image restoration tasks.
The model effectively handles multiple degradation types with a degradation classifier.
Constructed a mixed degradation dataset with synthetic captions for training.
Abstract
Vision-language models such as CLIP have shown great impact on diverse downstream tasks for zero-shot or label-free predictions. However, when it comes to low-level vision such as image restoration their performance deteriorates dramatically due to corrupted inputs. In this paper, we present a degradation-aware vision-language model (DA-CLIP) to better transfer pretrained vision-language models to low-level vision tasks as a multi-task framework for image restoration. More specifically, DA-CLIP trains an additional controller that adapts the fixed CLIP image encoder to predict high-quality feature embeddings. By integrating the embedding into an image restoration network via cross-attention, we are able to pilot the model to learn a high-fidelity image reconstruction. The controller itself will also output a degradation feature that matches the real corruptions of the input, yielding a…
Peer Reviews
Decision·ICLR 2024 poster
- This paper proposes a novel framework, DA-CLIP, which combines large-scale pretrained vision-language models with image restoration networks. - This paper introduces an Image Controller that addresses the feature mismatching issue between corrupted inputs and clean captions in existing vision-language models. In addition, they introduce a prompt learning module to better utilize the degradation context for unified image restoration. - It demonstrates that DA-CLIP in both degradation-specific a
- In Figure 1, DA-CLIP achieves surprisingly high accuracy in ten degradation types. How are these experiments set up? In contrast, CLIP performs poorly in many types. What prompts do the authors use for classifying degradations in CLIP? - In Figure 6, PromptIR is comparable or even better than the proposed DA-CLIP in most tasks on fidelity metrics. - In Table 2(c), the PSNR of DA-CLIP highly deviates from that of MAXIM. In addition, the results on task-specific restoration do not show a clear b
I hold a positive view on the idea of incorporating degradation information into CLIP.
My main concern with this paper is its task setting. First, "Universal Image Restoration" is a term that is not so easily justified. This paper simply brings together ten different image restoration tasks, which is closer to "multi-task" than the so-called "universal". For a large model, mixing these ten tasks in such a separate manner for training, the model would internally categorize the problems before handling them in a single-task manner [R1]. This would not endow the model with sufficient
1. The idea of constructing a vision-language model to restore clean semantic image representation and distinct degradation types of low-quality images is interesting. 2. The method of using clean image representative and degradation prompt to instruct restoration networks for better performance is sound. 3. The results look good, and the experimental analysis demonstrate the effectiveness of the DA-CLIP on all-in-one image restoration. 4. The writing is well, and the paper is easy to read.
1. It is questionable that the caption embedding can provide a high-quality image representation supervision for the content embedding. $e_c^T$ can indeed provide a semantic supervision, but there is no guarantee that it is a clean image representation. Therefore, I think the claim that the image encoder with the controller outputs high-quality content features is not rigorous. It seems that $e_c^I$ mainly serves to provide semantic instruction for the restoration network, especially for diffusi
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Advanced Image Processing Techniques
MethodsContrastive Language-Image Pre-training · Diffusion
