Controlling Vision-Language Models for Multi-Task Image Restoration

Ziwei Luo; Fredrik K. Gustafsson; Zheng Zhao; Jens Sj\"olund; Thomas; B. Sch\"on

arXiv:2310.01018·cs.CV·February 29, 2024·21 cites

Controlling Vision-Language Models for Multi-Task Image Restoration

Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sj\"olund, Thomas, B. Sch\"on

PDF

Open Access 1 Repo 1 Models 1 Video 3 Reviews

TL;DR

This paper introduces DA-CLIP, a novel multi-task framework that adapts pretrained vision-language models for high-fidelity image restoration, effectively handling various degradation types and improving state-of-the-art results.

Contribution

The paper proposes a degradation-aware adaptation of CLIP with a controller for multi-task image restoration, enabling better transfer from large-scale pretrained models to low-level vision tasks.

Findings

01

DA-CLIP achieves state-of-the-art performance on image restoration tasks.

02

The model effectively handles multiple degradation types with a degradation classifier.

03

Constructed a mixed degradation dataset with synthetic captions for training.

Abstract

Vision-language models such as CLIP have shown great impact on diverse downstream tasks for zero-shot or label-free predictions. However, when it comes to low-level vision such as image restoration their performance deteriorates dramatically due to corrupted inputs. In this paper, we present a degradation-aware vision-language model (DA-CLIP) to better transfer pretrained vision-language models to low-level vision tasks as a multi-task framework for image restoration. More specifically, DA-CLIP trains an additional controller that adapts the fixed CLIP image encoder to predict high-quality feature embeddings. By integrating the embedding into an image restoration network via cross-attention, we are able to pilot the model to learn a high-fidelity image reconstruction. The controller itself will also output a degradation feature that matches the real corruptions of the input, yielding a…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

- This paper proposes a novel framework, DA-CLIP, which combines large-scale pretrained vision-language models with image restoration networks. - This paper introduces an Image Controller that addresses the feature mismatching issue between corrupted inputs and clean captions in existing vision-language models. In addition, they introduce a prompt learning module to better utilize the degradation context for unified image restoration. - It demonstrates that DA-CLIP in both degradation-specific a

Weaknesses

- In Figure 1, DA-CLIP achieves surprisingly high accuracy in ten degradation types. How are these experiments set up? In contrast, CLIP performs poorly in many types. What prompts do the authors use for classifying degradations in CLIP? - In Figure 6, PromptIR is comparable or even better than the proposed DA-CLIP in most tasks on fidelity metrics. - In Table 2(c), the PSNR of DA-CLIP highly deviates from that of MAXIM. In addition, the results on task-specific restoration do not show a clear b

Reviewer 02Rating 3· reject, not good enoughConfidence 5

Strengths

I hold a positive view on the idea of incorporating degradation information into CLIP.

Weaknesses

My main concern with this paper is its task setting. First, "Universal Image Restoration" is a term that is not so easily justified. This paper simply brings together ten different image restoration tasks, which is closer to "multi-task" than the so-called "universal". For a large model, mixing these ten tasks in such a separate manner for training, the model would internally categorize the problems before handling them in a single-task manner [R1]. This would not endow the model with sufficient

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

1. The idea of constructing a vision-language model to restore clean semantic image representation and distinct degradation types of low-quality images is interesting. 2. The method of using clean image representative and degradation prompt to instruct restoration networks for better performance is sound. 3. The results look good, and the experimental analysis demonstrate the effectiveness of the DA-CLIP on all-in-one image restoration. 4. The writing is well, and the paper is easy to read.

Weaknesses

1. It is questionable that the caption embedding can provide a high-quality image representation supervision for the content embedding. $e_c^T$ can indeed provide a semantic supervision, but there is no guarantee that it is a clean image representation. Therefore, I think the claim that the image encoder with the controller outputs high-quality content features is not rigorous. It seems that $e_c^I$ mainly serves to provide semantic instruction for the restoration network, especially for diffusi

Code & Models

Repositories

algolzw/daclip-uir
pytorchOfficial

Models

🤗
weblzw/daclip-uir-ViT-B-32-irsde
model· ♡ 4
♡ 4

Videos

Controlling Vision-Language Models for Multi-Task Image Restoration· slideslive

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Advanced Image Processing Techniques

MethodsContrastive Language-Image Pre-training · Diffusion