RWKV-CLIP: A Robust Vision-Language Representation Learner
Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu,, Weidong Cai, Jiankang Deng

TL;DR
RWKV-CLIP introduces a novel vision-language model combining transformer training with RNN inference, utilizing improved data synthesis techniques, achieving state-of-the-art results in various vision-language tasks.
Contribution
It is the first RWKV-driven vision-language model that enhances data quality with large language models and demonstrates robust, efficient performance across multiple tasks.
Findings
Achieves state-of-the-art results in zero-shot classification
Demonstrates robustness across different model scales and datasets
Efficient training and inference combining transformers and RNNs
Abstract
Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks by expanding the dataset with image-text pairs obtained from websites. This paper further explores CLIP from the perspectives of data and model architecture. To address the prevalence of noisy data and enhance the quality of large-scale image-text data crawled from the internet, we introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. Furthermore, we propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs. Comprehensive experiments across various model scales and pre-training datasets demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗sunatte/txt2sqlmodel
- 🤗MachoMaheen/devdock4bitmodel
- 🤗Kaichengalex/RWKV-CLIP-B16-LAION10Mmodel· ♡ 1♡ 1
- 🤗Kaichengalex/RWKV-CLIP-B32-LAION10Mmodel
- 🤗Kaichengalex/RWKV-CLIP-B32-LAION30Mmodel
- 🤗Kaichengalex/RWKV-CLIP-B32-YFCC15Mmodel
- 🤗sicer/arc-agi-legacymodel
- 🤗JilinHu/llemma_7b_3epoch_r32_e5_RQ1model· 1 dl1 dl
- 🤗Xin-Rui/LLAMA-Fac-NEW-A800model· ♡ 1♡ 1
- 🤗Linksome/lmfmodel
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
