Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models
Cansu Korkmaz, Ahmet Murat Tekalp, Zafer Dogan

TL;DR
This paper presents a novel framework that uses vision-language models to select the most trustworthy super-resolution samples generated by diffusion models, improving reliability and semantic correctness.
Contribution
It introduces a VLM-based selection method and a new Trustworthiness Score (TWS) to evaluate and choose high-quality SR outputs from diffusion models.
Findings
VLM-guided selection yields higher TWS scores.
TWS correlates strongly with human preferences.
The approach outperforms traditional metrics like PSNR and LPIPS.
Abstract
Super-resolution (SR) is an ill-posed inverse problem with many feasible solutions consistent with a given low-resolution image. On one hand, regressive SR models aim to balance fidelity and perceptual quality to yield a single solution, but this trade-off often introduces artifacts that create ambiguity in information-critical applications such as recognizing digits or letters. On the other hand, diffusion models generate a diverse set of SR images, but selecting the most trustworthy solution from this set remains a challenge. This paper introduces a robust, automated framework for identifying the most trustworthy SR sample from a diffusion-generated set by leveraging the semantic reasoning capabilities of vision-language models (VLMs). Specifically, VLMs such as BLIP-2, GPT-4o, and their variants are prompted with structured queries to assess semantic correctness, visual quality, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training · Sparse Evolutionary Training · Diffusion
