Understanding the Vulnerability of CLIP to Image Compression
Cangxiong Chen, Vinay P. Namboodiri, Julian Padget

TL;DR
This paper reveals that CLIP, a popular vision-language model, is vulnerable to image compression, affecting its zero-shot recognition accuracy, and provides insights to improve its robustness.
Contribution
It demonstrates CLIP's vulnerability to image compression and uses attribution methods to analyze the impact, aiding future robustness improvements.
Findings
CLIP's recognition accuracy decreases with image compression.
Attribution analysis reveals how compression affects model decisions.
Extensive evaluation on CIFAR-10 and STL-10 supports findings.
Abstract
CLIP is a widely used foundational vision-language model that is used for zero-shot image recognition and other image-text alignment tasks. We demonstrate that CLIP is vulnerable to change in image quality under compression. This surprising result is further analysed using an attribution method-Integrated Gradients. Using this attribution method, we are able to better understand both quantitatively and qualitatively exactly the nature in which the compression affects the zero-shot recognition accuracy of this model. We evaluate this extensively on CIFAR-10 and STL-10. Our work provides the basis to understand this vulnerability of CLIP and can help us develop more effective methods to improve the robustness of CLIP and other vision-language models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Advanced Image Processing Techniques · Medical Imaging Techniques and Applications
MethodsContrastive Language-Image Pre-training
