CLIPPO: Image-and-Language Understanding from Pixels Only
Michael Tschannen, Basil Mustafa, Neil Houlsby

TL;DR
CLIPPO introduces a unified pixel-based model that performs image, text, and multimodal tasks using only contrastive loss, achieving competitive results with fewer parameters and no text-specific components.
Contribution
The paper presents CLIPPO, a novel pixel-only multimodal model that unifies image and text processing without task-specific modules or tokenizers, simplifying training and inference.
Findings
Performs image retrieval and zero-shot classification nearly as well as CLIP.
Achieves strong natural language understanding without word-level training.
Excels in multilingual multimodal retrieval without modifications.
Abstract
Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-specific pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP-style models, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Dropout · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Multi-Head Attention · Residual Connection · Label Smoothing · Contrastive Learning
