TL;DR
GELATO introduces a multimodal embedding model that efficiently encodes text, images, audio, and video into a unified semantic space by freezing backbone models and training only minimal connecting components.
Contribution
The paper presents GELATO, a novel multimodal embedding approach that extends existing models with minimal training, achieving state-of-the-art performance across multiple modalities.
Findings
GELATO produces competitive results with larger models.
Training only 0.35% of total weights reduces computational cost.
The model maintains text embedding quality identical to prior models.
Abstract
In this work, we introduce GELATO (Geometry-preserving Embeddings via Locked Aligned TOwers), a novel approach to multimodal embedding models. We build on the VLM-style architecture, in which non-text encoders are adapted to produce input for a language model, which in turn generates embeddings for all varieties of input. We present the result: the jina-embeddings-v5-omni suite, a pair of models that encode text, image, audio, and video input into a single semantic embedding space. GELATO extends the two Jina Embeddings v5 Text models to support additional modality by adding encoders for images and audio. The backbone text embedding models and the added non-text modality encoders remain frozen. We only trained the connecting components, representing 0.35% of the total weights of the joint model. Training is therefore much more efficient than full-parameter retraining. Additionally, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗jinaai/jina-embeddings-v5-omni-smallmodel· 34k dl· ♡ 5834k dl♡ 58
- 🤗jinaai/jina-embeddings-v5-omni-small-retrievalmodel· 35k dl· ♡ 1035k dl♡ 10
- 🤗jinaai/jina-embeddings-v5-omni-nanomodel· 35k dl· ♡ 2235k dl♡ 22
- 🤗jinaai/jina-embeddings-v5-omni-nano-retrievalmodel· 31k dl· ♡ 731k dl♡ 7
- 🤗jinaai/jina-embeddings-v5-omni-nano-clusteringmodel· 11k dl· ♡ 311k dl♡ 3
- 🤗jinaai/jina-embeddings-v5-omni-small-retrieval-GGUFmodel· 1.7k dl· ♡ 21.7k dl♡ 2
- 🤗jinaai/jina-embeddings-v5-omni-nano-retrieval-GGUFmodel· 1.3k dl· ♡ 11.3k dl♡ 1
- 🤗jinaai/jina-embeddings-v5-omni-small-classificationmodel· 12k dl· ♡ 112k dl♡ 1
- 🤗jinaai/jina-embeddings-v5-omni-small-clusteringmodel· 12k dl· ♡ 112k dl♡ 1
- 🤗jinaai/jina-embeddings-v5-omni-small-text-matchingmodel· 12k dl· ♡ 112k dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
