RWKV-CLIP: A Robust Vision-Language Representation Learner

Tiancheng Gu; Kaicheng Yang; Xiang An; Ziyong Feng; Dongnan Liu,; Weidong Cai; Jiankang Deng

arXiv:2406.06973·cs.CV·September 24, 2024

RWKV-CLIP: A Robust Vision-Language Representation Learner

Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu,, Weidong Cai, Jiankang Deng

PDF

Open Access 2 Repos 10 Models 1 Datasets 1 Video

TL;DR

RWKV-CLIP introduces a novel vision-language model combining transformer training with RNN inference, utilizing improved data synthesis techniques, achieving state-of-the-art results in various vision-language tasks.

Contribution

It is the first RWKV-driven vision-language model that enhances data quality with large language models and demonstrates robust, efficient performance across multiple tasks.

Findings

01

Achieves state-of-the-art results in zero-shot classification

02

Demonstrates robustness across different model scales and datasets

03

Efficient training and inference combining transformers and RNNs

Abstract

Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks by expanding the dataset with image-text pairs obtained from websites. This paper further explores CLIP from the perspectives of data and model architecture. To address the prevalence of noisy data and enhance the quality of large-scale image-text data crawled from the internet, we introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. Furthermore, we propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs. Comprehensive experiments across various model scales and pre-training datasets demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Kaichengalex/YFCC15M
dataset· 3.2k dl
3.2k dl

Videos

RWKV-CLIP: A Robust Vision-Language Representation Learner· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training