Vision-Language Model Guided Image Restoration
Cuixin Yang, Rongkang Dong, Kin-Man Lam

TL;DR
This paper introduces VLMIR, a novel image restoration framework that leverages vision-language models like CLIP to incorporate semantic priors, resulting in improved visual perception and semantic coherence in restored images.
Contribution
The paper proposes a two-stage VLMIR framework that integrates vision-language priors into diffusion-based image restoration, enhancing both pixel fidelity and semantic understanding.
Findings
VLMIR outperforms existing methods on multiple IR benchmarks.
Incorporating linguistic priors improves semantic coherence in restored images.
Ablation studies confirm the effectiveness of VLM-based feature extraction.
Abstract
Many image restoration (IR) tasks require both pixel-level fidelity and high-level semantic understanding to recover realistic photos with fine-grained details. However, previous approaches often struggle to effectively leverage both the visual and linguistic knowledge. Recent efforts have attempted to incorporate Vision-language models (VLMs), which excel at aligning visual and textual features, into universal IR. Nevertheless, these methods fail to utilize the linguistic priors to ensure semantic coherence during the restoration process. To address this issue, in this paper, we propose the Vision-Language Model Guided Image Restoration (VLMIR) framework, which leverages the rich vision-language priors of VLMs, such as CLIP, to enhance IR performance through improved visual perception and semantic understanding. Our approach consists of two stages: VLM-based feature extraction and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Multimodal Machine Learning Applications
