FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

Christian Schlarmann; Francesco Croce; Nicolas Flammarion; Matthias Hein

arXiv:2506.03096·cs.CV·June 4, 2025

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

Christian Schlarmann, Francesco Croce, Nicolas Flammarion, Matthias Hein

PDF

Open Access 4 Models

TL;DR

FuseLIP introduces a novel early fusion transformer architecture that combines text and image tokens into a single embedding space, enabling richer multimodal representations and improved performance on tasks like VQA.

Contribution

It proposes using a single transformer with an extended vocabulary for early fusion of multimodal tokens, advancing multimodal embedding techniques.

Findings

01

Outperforms other methods in VQA and image transformation retrieval

02

Achieves comparable results to baselines on unimodal tasks

03

Introduces new datasets for multimodal pre-training and evaluation

Abstract

Contrastive language-image pre-training aligns the features of text-image pairs in a common latent space via distinct encoders for each modality. While this approach achieves impressive performance in several zero-shot tasks, it cannot natively handle multimodal inputs, i.e., encoding image and text into a single feature vector. As a remedy, it is common practice to use additional modules to merge the features extracted by the unimodal encoders. In this work, we present FuseLIP, an alternative architecture for multimodal embedding. Leveraging recent progress in discrete image tokenizers, we propose to use a single transformer model which operates on an extended vocabulary of text and image tokens. This early fusion approach allows the different modalities to interact at each depth of encoding and obtain richer representations compared to common late fusion. We collect new datasets for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Speech and dialogue systems