TL;DR
This paper introduces a brain-inspired spiking neural network for image-text retrieval that achieves high accuracy with low energy consumption by integrating unimodal features at the spike level.
Contribution
It presents the first application of a spike fusion mechanism in multimodal SNNs, enhancing representations and efficiency in image-text retrieval.
Findings
CMSF surpasses state-of-the-art ANN methods in retrieval accuracy.
CMSF achieves high speed and low energy consumption with only two time steps.
The proposed framework offers new insights for future multimodal SNN research.
Abstract
Spiking neural networks (SNNs) have recently shown strong potential in unimodal visual and textual tasks, yet building a directly trained, low-energy, and high-performance SNN for multimodal applications such as image-text retrieval (ITR) remains highly challenging. Existing artificial neural network (ANN)-based methods often pursue richer unimodal semantics using deeper and more complex architectures, while overlooking cross-modal interaction, retrieval latency, and energy efficiency. To address these limitations, we present a brain-inspired Cross-Modal Spike Fusion network (CMSF) and apply it to ITR for the first time. The proposed spike fusion mechanism integrates unimodal features at the spike level, generating enhanced multimodal representations that act as soft supervisory signals to refine unimodal spike embeddings, effectively mitigating semantic loss within CMSF. Despite…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
