SPKLIP: Aligning Spike Video Streams with Natural Language
Yongchang Gao, Meiling Jin, Zhaofei Yu, Tiejun Huang, Guozhang Chen

TL;DR
SPKLIP is a novel architecture that effectively aligns sparse spike video streams with natural language, leveraging hierarchical feature extraction and spike-text contrastive learning for improved performance and energy efficiency.
Contribution
It introduces the first Spike Video-Language Alignment model with a hierarchical spike feature extractor and spike-text contrastive learning, enhancing few-shot learning and energy efficiency.
Findings
Achieves state-of-the-art results on spike datasets.
Demonstrates strong few-shot generalization.
Shows energy-efficient neuromorphic deployment potential.
Abstract
Spike cameras offer unique sensing capabilities but their sparse, asynchronous output challenges semantic understanding, especially for Spike Video-Language Alignment (Spike-VLA) where models like CLIP underperform due to modality mismatch. We introduce SPKLIP, the first architecture specifically for Spike-VLA. SPKLIP employs a hierarchical spike feature extractor that adaptively models multi-scale temporal dynamics in event streams, and uses spike-text contrastive learning to directly align spike video with language, enabling effective few-shot learning. A full-spiking visual encoder variant, integrating SNN components into our pipeline, demonstrates enhanced energy efficiency. Experiments show state-of-the-art performance on benchmark spike datasets and strong few-shot generalization on a newly contributed real-world dataset. SPKLIP's energy efficiency highlights its potential for…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Novel Problem Formulation: First work addressing video-language alignment specifically for spike cameras, filling an important gap between neuromorphic vision and semantic understanding. 2. Well-Motivated Architecture: The HSFE module with multi-scale temporal filtering (MTF) and spatial attention (SA) is thoughtfully designed to handle spike data's unique characteristics—sparse, asynchronous, high-frequency event streams. 3. Multimodal Alignment Validation: Text-to-video retrieval experiment
1. HMDB51-S and UCF101-S are synthetically generated from RGB videos using SpikeCV toolkit, not real spike camera data. 2. Full-spiking variant (FSVE) drops from 86.43% to 71.11% (CNN→SNN) and 65.24% (full SNN) on UCF101-S. This 21.19% drop undermines claims about neuromorphic deployment viability. 3. Real dataset contains only 384 samples (96×4) across 4 simple actions. 4. Contrastive loss (Eq. 6) is standard CLIP loss—limited novelty. 5. No comparison with simple baselines like averaging spike
1) The motivation of the work is strong. Spiking Cameras offer advantages not present in conventional cameras however there has been limited work done on efficient processing of spiking video streams. 2) The HSFE module proposed in the paper seemed interesting.
1) The computations inside the HSFE module does not seem entirely spiking. 2) Most of the other parts of the model proposed (text encoder, even parts of the visual encoder) can be derived from available literature. The contrastive learning loss proposed is used in most video-language models like UniVTG, etc. Thus, making the work seem more of an engineering endeavor. 3) Performance comparison with baselines might not be fair since they are evaluated on a spiking variant of the datasets.
1) Its originality is highlighted by SPKLIP, the first end-to-end framework specifically designed for Spike Video-Language Alignment, as well as the introduction of an energy-efficient Full-Spiking Visual Encoder; 2) A new real-world spike video dataset was also constructed. 3) The experimental results show that the SPKLIP yielded substantial Top-1 accuracy improvements over baselines, robust few-shot generalization, and effective text-to-video retrieval.
1) While the paper highlights the Full-Spiking Visual Encoder as a significant contribution, its connection to the main SPKLIP framework and its direct effectiveness are not fully explored through experiments. It remains ambiguous how FSVE directly contributes to or could enhance or degrade the main SPKLIP framework's performance. 2) For the Hierarchical Spike Feature Extractor, the decision to divide the input spike stream into "five temporally overlapping sub-blocks" raises questions. The ra
1. It innovatively fills a gap in the field by proposing the first end-to-end architecture dedicated to Spike-VLA, effectively resolving the modality mismatch issue of traditional vision-language models on spike data. 2. The design of core components is highly targeted: HSFE adapts to the sparse and asynchronous characteristics of spike data, while STCL enables direct alignment between spike videos and text, and their synergy enhances cross-modal semantic understanding. 3. The experimental valid
1. Typos: Line 149: spikestatus(”0”or”1”). The formulation is confusing. Line 150, the letter of "H x W" is different from "H x W" in line 158. I hope the authors can fix these typos. 2. The theoretical in-depth exploration of the photon conservation mechanism in HSFE is insufficient, and no comparative analysis with existing dynamic channel allocation methods (e.g., attention-driven channel selection) is conducted. Could you explain this? 3. Although the FSVE improves energy efficiency, it suff
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices
MethodsContrastive Learning · Spiking Neural Networks · Contrastive Language-Image Pre-training · ALIGN
