Long-CLIP: Unlocking the Long-Text Capability of CLIP

Beichen Zhang; Pan Zhang; Xiaoyi Dong; Yuhang Zang; Jiaqi Wang

arXiv:2403.15378·cs.CV·July 23, 2024·2 cites

Long-CLIP: Unlocking the Long-Text Capability of CLIP

Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang

PDF

Open Access 1 Repo 5 Models

TL;DR

Long-CLIP extends CLIP's capabilities to handle long, detailed text inputs for improved image retrieval and generation, maintaining zero-shot performance without extensive retraining.

Contribution

The paper introduces Long-CLIP, a plug-and-play method that enables CLIP to process long texts effectively while preserving its original zero-shot abilities.

Findings

01

Achieves 20% improvement in long caption image retrieval.

02

Improves traditional retrieval tasks by 6%.

03

Supports detailed text-to-image generation.

Abstract

Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective length is even less than 20. This prevents CLIP from handling detailed descriptions, limiting its applications for image retrieval and text-to-image generation with extensive prerequisites. To this end, we propose Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input, retains or even surpasses its zero-shot generalizability, and aligns the CLIP latent space, making it readily replace CLIP without any further adaptation in downstream frameworks. Nevertheless, achieving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

beichenzbc/long-clip
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsContrastive Language-Image Pre-training