Enhancing Audio-Visual Spiking Neural Networks through   Semantic-Alignment and Cross-Modal Residual Learning

Xiang He; Dongcheng Zhao; Yiting Dong; Guobin Shen; Xin Yang; Yi Zeng

arXiv:2502.12488·cs.CV·February 19, 2025

Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning

Xiang He, Dongcheng Zhao, Yiting Dong, Guobin Shen, Xin Yang, Yi Zeng

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel Transformer-based multimodal Spiking Neural Network framework called S-CMRL that effectively integrates audio-visual information through semantic alignment and residual learning, achieving state-of-the-art results.

Contribution

The paper proposes a new SNN architecture with semantic-alignment cross-modal residual learning for improved audio-visual integration in multimodal scenarios.

Findings

01

S-CMRL outperforms existing methods on benchmark datasets.

02

Achieves state-of-the-art performance in multimodal SNN tasks.

03

Demonstrates effective cross-modal feature alignment and fusion.

Abstract

Humans interpret and perceive the world by integrating sensory information from multiple modalities, such as vision and hearing. Spiking Neural Networks (SNNs), as brain-inspired computational models, exhibit unique advantages in emulating the brain's information processing mechanisms. However, existing SNN models primarily focus on unimodal processing and lack efficient cross-modal information fusion, thereby limiting their effectiveness in real-world multimodal scenarios. To address this challenge, we propose a semantic-alignment cross-modal residual learning (S-CMRL) framework, a Transformer-based multimodal SNN architecture designed for effective audio-visual integration. S-CMRL leverages a spatiotemporal spiking attention mechanism to extract complementary features across modalities, and incorporates a cross-modal residual learning strategy to enhance feature integration.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

brain-cog-lab/s-cmrl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHearing Loss and Rehabilitation · Multisensory perception and integration · Music and Audio Processing

MethodsSoftmax · Attention Is All You Need · Spiking Neural Networks · ALIGN · Focus