LightCLIP: Learning Multi-Level Interaction for Lightweight   Vision-Language Models

Ying Nie; Wei He; Kai Han; Yehui Tang; Tianyu Guo; Fanyi Du; Yunhe; Wang

arXiv:2312.00674·cs.CV·December 4, 2023·1 cites

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

Ying Nie, Wei He, Kai Han, Yehui Tang, Tianyu Guo, Fanyi Du, Yunhe, Wang

PDF

Open Access

TL;DR

This paper introduces LightCLIP, a lightweight vision-language model that employs multi-level interaction and refined alignment objectives to improve performance on downstream tasks without extra inference cost.

Contribution

The paper proposes a multi-level interaction paradigm, including a relaxed bipartite matching for token-level alignment and an MLM objective, to enhance lightweight CLIP models.

Findings

01

Achieves higher downstream task performance without extra inference cost.

02

Improves fine-grained image-text alignment with relaxed bipartite matching.

03

Leverages MLM with an auxiliary fusion module to maximize text encoder potential.

Abstract

Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image encoders like ResNet50 and ViT, while the lightweight counterparts are rarely discussed. In this paper, we propose a multi-level interaction paradigm for training lightweight CLIP models. Firstly, to mitigate the problem that some image-text pairs are not strictly one-to-one correspondence, we improve the conventional global instance-level alignment objective by softening the label of negative samples progressively. Secondly, a relaxed bipartite matching based token-level alignment objective is introduced for finer-grained alignment between image patches and textual words. Moreover, based on the observation that the accuracy of CLIP model does not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Cancer-related molecular mechanisms research

MethodsContrastive Language-Image Pre-training