Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Andreas Koukounas; Georgios Mastrapas; Michael G\"unther; Bo Wang,; Scott Martens; Isabelle Mohr; Saba Sturua; Mohammad Kalim Akram; Joan; Fontanals Mart\'inez; Saahil Ognawala; Susana Guzman; Maximilian Werk; Nan; Wang; Han Xiao

arXiv:2405.20204·cs.CL·June 27, 2024·2 cites

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Andreas Koukounas, Georgios Mastrapas, Michael G\"unther, Bo Wang,, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan, Fontanals Mart\'inez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan, Wang, Han Xiao

PDF

Open Access 6 Models

TL;DR

Jina CLIP introduces a multi-task contrastive training approach that enables a single CLIP-based model to excel in both multimodal and text-only retrieval tasks, streamlining information retrieval systems.

Contribution

The paper presents a novel multi-task contrastive training method that improves CLIP's performance on text-only tasks without sacrificing multimodal retrieval capabilities.

Findings

01

Achieved state-of-the-art results on text-image retrieval

02

Improved text-only retrieval performance

03

Unified model for multimodal and text-only tasks

Abstract

Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsALIGN · Contrastive Language-Image Pre-training