SpikeCLIP: A Contrastive Language-Image Pretrained Spiking Neural Network

Changze Lv; Tianlong Li; Wenhao Liu; Yufei Gu; Jianhan Xu; Cenyuan Zhang; Muling Wu; Xiaoqing Zheng; Xuanjing Huang

arXiv:2310.06488·cs.NE·May 22, 2025·1 cites

SpikeCLIP: A Contrastive Language-Image Pretrained Spiking Neural Network

Changze Lv, Tianlong Li, Wenhao Liu, Yufei Gu, Jianhan Xu, Cenyuan Zhang, Muling Wu, Xiaoqing Zheng, Xuanjing Huang

PDF

Open Access 1 Repo 4 Reviews

TL;DR

SpikeCLIP introduces a novel spiking neural network framework that effectively integrates visual and linguistic features, achieving competitive performance with lower energy consumption in multimodal tasks.

Contribution

The paper presents SpikeCLIP, a new approach combining alignment pre-training and dual-loss fine-tuning to unify multimodal features in SNNs, advancing energy-efficient multimodal learning.

Findings

01

SNNs achieve comparable results to ANNs in multimodal tasks.

02

SpikeCLIP significantly reduces energy consumption.

03

Maintains robust classification even for out-of-category classes.

Abstract

Spiking Neural Networks (SNNs) have emerged as a promising alternative to conventional Artificial Neural Networks (ANNs), demonstrating comparable performance in both visual and linguistic tasks while offering the advantage of improved energy efficiency. Despite these advancements, the integration of linguistic and visual features into a unified representation through spike trains poses a significant challenge, and the application of SNNs to multimodal scenarios remains largely unexplored. This paper presents SpikeCLIP, a novel framework designed to bridge the modality gap in spike-based computation. Our approach employs a two-step recipe: an ``alignment pre-training'' to align features across modalities, followed by a ``dual-loss fine-tuning'' to refine the model's performance. Extensive experiments reveal that SNNs achieve results on par with ANNs while substantially reducing energy…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 5

Strengths

- SpikeCLIP is the first multimodal SNN architecture that shows the ability to deal with both text and image input. - The paper proposes an effective training framework to train SpikeCLIP. The alignment pre-training part finds a good way to transfer the representation learned in CLIP to SpikeCLIP. - The implementation of the framework and the experiments are solid. Moreover, the experiments results shows robust performance in image classification tasks (including zero-shot setting).

Weaknesses

- Although this is the first paper (as far as I know) to realize spiking version of CLIP, the paper itself is lack of enough novelty. Nowadays, as the surrogate gradient based SNN training methods have been greatly developed, transferring or reproducing a specific architecture in conventional ANN(artificial neural network) to SNN is never a significant issue. More important thing is actually to find the specifics of spikes in those architectures or settings, rather than claiming “we are the firs

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

1. This paper presents the SNN image-text multi-modal model. 2. The two-stage fine-tuning approach retains the performance of the original CLIP model on various tasks, including classification with undefined class labels.

Weaknesses

1.The proposed model architecture for each modal separately is not innovative enough. The image encoder uses an existing SNN architecture (Spikingformer), while the text encoder is a simpler MLP structure, bypassing the difficulties of processing long sequences with SNNs. This design choice improves training efficiency but may limit the model's text-processing capabilities. 2.The two-stage training process of distillation followed by task-specific fine-tuning lacks specific optimization for SNN

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 5

Strengths

1. This is the first work to transfer the CLIP to the SNN field. 2. The authors provide the code, which is good.

Weaknesses

1. The novelty is limited. The two steps can be seen as the KD method. So the work just uses a KD method to convert a CLIP as SpikeCLIP. 2. The results are not good. For image classification, the accuracy is worse than other SOTA methods. For the zero-shot task, since SpikeCLIP is not trained on a really large dataset, it is much worse than CLIP, thus the value of SpikeCLIP is limited, considering that the greatest value of CLIP is suitable for zero-shot tasks.

Reviewer 04Rating 3· reject, not good enoughConfidence 4

Strengths

1. The author proposed a cross-modal model for SNN and explored some downstream tasks based on it.

Weaknesses

1. The author mainly adopted the idea of knowledge distillation to train SpikeCLIP, however, a similar scheme [1] has been proposed previously. In addition, regarding the algorithm and neuron model design of SNN, I think the contribution of this paper is very limited. I think this paper is more about directly transferring the concepts related to CLIP to the field of SNN and lacks technical contributions related to SNN. 2. The performance of SpikeCLIP on downstream datasets (CIFAR-10, CIFAR-100)

Code & Models

Repositories

lvchangze/spikeclip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices · Neural Networks and Reservoir Computing

MethodsALIGN