Unveiling Glitches: A Deep Dive into Image Encoding Bugs within CLIP
Ayush Ranjan, Daniel Wen, Karthik Bhat

TL;DR
This paper investigates the limitations of CLIP, a vision-language model, by identifying systemic image understanding faults through novel analysis frameworks, highlighting areas for improvement in AI image comprehension.
Contribution
The study introduces the Discrepancy Analysis Framework and Transformative Caption Analysis for CLIP to systematically uncover 14 key systemic faults in CLIP's image interpretation.
Findings
Identified 14 systemic faults in CLIP's image understanding
Revealed significant discrepancies between CLIP and human perception
Provided insights for improving AI image embedding models
Abstract
Understanding the limitations and weaknesses of state-of-the-art models in artificial intelligence is crucial for their improvement and responsible application. In this research, we focus on CLIP, a model renowned for its integration of vision and language processing. Our objective is to uncover recurring problems and blind spots in CLIP's image comprehension. By delving into both the commonalities and disparities between CLIP and human image understanding, we augment our comprehension of these models' capabilities. Through our analysis, we reveal significant discrepancies in CLIP's interpretation of images compared to human perception, shedding light on areas requiring improvement. Our methodologies, the Discrepancy Analysis Framework (DAF) and the Transformative Caption Analysis for CLIP (TCAC), enable a comprehensive evaluation of CLIP's performance. We identify 14 systemic faults,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)
MethodsContrastive Language-Image Pre-training · Focus
