CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus   Intrinsic Neighbors Guidance

Chu Myaet Thwal; Ye Lin Tun; Minh N. H. Nguyen; Eui-Nam Huh; Choong; Seon Hong

arXiv:2412.03871·cs.CV·March 24, 2025·2 cites

CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

Chu Myaet Thwal, Ye Lin Tun, Minh N. H. Nguyen, Eui-Nam Huh, Choong, Seon Hong

PDF

Open Access

TL;DR

CLIP-PING introduces a simple training method that enhances lightweight vision-language models by leveraging intrinsic neighbor guidance, significantly improving zero-shot and retrieval performance with minimal extra computation.

Contribution

The paper proposes CLIP-PING, a novel training paradigm that uses neighbor-based contrastive supervision to boost lightweight models' cross-modal alignment and semantic diversity.

Findings

01

5.5% improvement in zero-shot ImageNet1K classification

02

10.7% and 5.7% gains in Flickr30K retrieval tasks

03

Strong transferability across downstream tasks

Abstract

Beyond the success of Contrastive Language-Image Pre-training (CLIP), recent trends mark a shift toward exploring the applicability of lightweight vision-language models for resource-constrained scenarios. These models often deliver suboptimal performance when relying solely on a single image-text contrastive learning objective, spotlighting the need for more effective training mechanisms that guarantee robust cross-modal feature alignment. In this work, we propose CLIP-PING: Contrastive Language-Image Pre-training with Proximus Intrinsic Neighbors Guidance, a novel yet simple and efficient training paradigm designed to boost the performance of lightweight vision-language models with minimal computational overhead and lower data demands. CLIP-PING bootstraps unimodal features extracted from arbitrary pre-trained encoders to obtain intrinsic guidance of proximus neighbor samples, i.e.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques

MethodsKnowledge Distillation · Contrastive Learning · Contrastive Language-Image Pre-training