CLIPPO: Image-and-Language Understanding from Pixels Only

Michael Tschannen; Basil Mustafa; Neil Houlsby

arXiv:2212.08045·cs.CV·April 4, 2023·1 cites

CLIPPO: Image-and-Language Understanding from Pixels Only

Michael Tschannen, Basil Mustafa, Neil Houlsby

PDF

Open Access 1 Repo

TL;DR

CLIPPO introduces a unified pixel-based model that performs image, text, and multimodal tasks using only contrastive loss, achieving competitive results with fewer parameters and no text-specific components.

Contribution

The paper presents CLIPPO, a novel pixel-only multimodal model that unifies image and text processing without task-specific modules or tokenizers, simplifying training and inference.

Findings

01

Performs image retrieval and zero-shot classification nearly as well as CLIP.

02

Achieves strong natural language understanding without word-level training.

03

Excels in multilingual multimodal retrieval without modifications.

Abstract

Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-specific pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP-style models, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/big_vision
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsAttention Is All You Need · Dropout · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Dense Connections · Multi-Head Attention · Residual Connection · Label Smoothing · Contrastive Learning