Brain-Inspired Multimodal Spiking Neural Network for Image-Text Retrieval

Xintao Zong; Xian Zhong; Wenxuan Liu; Jianhao Ding; Zhaofei Yu; Tiejun Huang

arXiv:2603.26787·cs.CV·March 31, 2026

Brain-Inspired Multimodal Spiking Neural Network for Image-Text Retrieval

Xintao Zong, Xian Zhong, Wenxuan Liu, Jianhao Ding, Zhaofei Yu, Tiejun Huang

PDF

1 Repo

TL;DR

This paper introduces a brain-inspired spiking neural network for image-text retrieval that achieves high accuracy with low energy consumption by integrating unimodal features at the spike level.

Contribution

It presents the first application of a spike fusion mechanism in multimodal SNNs, enhancing representations and efficiency in image-text retrieval.

Findings

01

CMSF surpasses state-of-the-art ANN methods in retrieval accuracy.

02

CMSF achieves high speed and low energy consumption with only two time steps.

03

The proposed framework offers new insights for future multimodal SNN research.

Abstract

Spiking neural networks (SNNs) have recently shown strong potential in unimodal visual and textual tasks, yet building a directly trained, low-energy, and high-performance SNN for multimodal applications such as image-text retrieval (ITR) remains highly challenging. Existing artificial neural network (ANN)-based methods often pursue richer unimodal semantics using deeper and more complex architectures, while overlooking cross-modal interaction, retrieval latency, and energy efficiency. To address these limitations, we present a brain-inspired Cross-Modal Spike Fusion network (CMSF) and apply it to ITR for the first time. The proposed spike fusion mechanism integrates unimodal features at the spike level, generating enhanced multimodal representations that act as soft supervisory signals to refine unimodal spike embeddings, effectively mitigating semantic loss within CMSF. Despite…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zxt6174/CMSF
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.