VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression
Kyle Sargent, Ruiqi Gao, Philipp Henzler, Charles Herrmann, Aleksander Holynski, Li Fei-Fei, Jiajun Wu, Jason Zhang

TL;DR
This paper introduces VLIC, a novel image compression system that uses vision-language models as perceptual judges, enabling human-aligned compression optimized through zero-shot reasoning and preference-based post-training.
Contribution
It demonstrates that vision-language models can serve as effective zero-shot perceptual judges for image compression, leading to state-of-the-art human-aligned performance without traditional perceptual loss training.
Findings
VLMs can replicate human judgments in image comparison tasks.
VLIC achieves competitive or state-of-the-art results on perceptual metrics.
Preference-based post-training improves human-aligned image compression.
Abstract
Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception. In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision-Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Image and Video Quality Assessment · Multimodal Machine Learning Applications
