Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning
Xiang He, Dongcheng Zhao, Yiting Dong, Guobin Shen, Xin Yang, Yi Zeng

TL;DR
This paper introduces a novel Transformer-based multimodal Spiking Neural Network framework called S-CMRL that effectively integrates audio-visual information through semantic alignment and residual learning, achieving state-of-the-art results.
Contribution
The paper proposes a new SNN architecture with semantic-alignment cross-modal residual learning for improved audio-visual integration in multimodal scenarios.
Findings
S-CMRL outperforms existing methods on benchmark datasets.
Achieves state-of-the-art performance in multimodal SNN tasks.
Demonstrates effective cross-modal feature alignment and fusion.
Abstract
Humans interpret and perceive the world by integrating sensory information from multiple modalities, such as vision and hearing. Spiking Neural Networks (SNNs), as brain-inspired computational models, exhibit unique advantages in emulating the brain's information processing mechanisms. However, existing SNN models primarily focus on unimodal processing and lack efficient cross-modal information fusion, thereby limiting their effectiveness in real-world multimodal scenarios. To address this challenge, we propose a semantic-alignment cross-modal residual learning (S-CMRL) framework, a Transformer-based multimodal SNN architecture designed for effective audio-visual integration. S-CMRL leverages a spatiotemporal spiking attention mechanism to extract complementary features across modalities, and incorporates a cross-modal residual learning strategy to enhance feature integration.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHearing Loss and Rehabilitation · Multisensory perception and integration · Music and Audio Processing
MethodsSoftmax · Attention Is All You Need · Spiking Neural Networks · ALIGN · Focus
