FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens
Christian Schlarmann, Francesco Croce, Nicolas Flammarion, Matthias Hein

TL;DR
FuseLIP introduces a novel early fusion transformer architecture that combines text and image tokens into a single embedding space, enabling richer multimodal representations and improved performance on tasks like VQA.
Contribution
It proposes using a single transformer with an extended vocabulary for early fusion of multimodal tokens, advancing multimodal embedding techniques.
Findings
Outperforms other methods in VQA and image transformation retrieval
Achieves comparable results to baselines on unimodal tasks
Introduces new datasets for multimodal pre-training and evaluation
Abstract
Contrastive language-image pre-training aligns the features of text-image pairs in a common latent space via distinct encoders for each modality. While this approach achieves impressive performance in several zero-shot tasks, it cannot natively handle multimodal inputs, i.e., encoding image and text into a single feature vector. As a remedy, it is common practice to use additional modules to merge the features extracted by the unimodal encoders. In this work, we present FuseLIP, an alternative architecture for multimodal embedding. Leveraging recent progress in discrete image tokenizers, we propose to use a single transformer model which operates on an extended vocabulary of text and image tokens. This early fusion approach allows the different modalities to interact at each depth of encoding and obtain richer representations compared to common late fusion. We collect new datasets for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Speech and dialogue systems
