Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning
Wenrui Li, Penghong Wang, Ruiqin Xiong, Xiaopeng Fan

TL;DR
This paper introduces a novel Spiking Tucker Fusion Transformer that combines SNNs and transformers to improve audio-visual zero-shot learning by effectively integrating temporal and semantic information, achieving state-of-the-art results.
Contribution
The paper proposes a new Spiking Tucker Fusion Transformer with dynamic inference, global-local pooling, and a Tucker fusion module for multi-scale SNN and transformer integration in ZSL.
Findings
Achieves state-of-the-art performance on three benchmark datasets.
Significant harmonic mean improvements of 15.4%, 3.9%, and 14.9%.
Demonstrates effective fusion of temporal and semantic features.
Abstract
The spiking neural networks (SNNs) that efficiently encode temporal sequences have shown great potential in extracting audio-visual joint feature representations. However, coupling SNNs (binary spike sequences) with transformers (float-point sequences) to jointly explore the temporal-semantic information still facing challenges. In this paper, we introduce a novel Spiking Tucker Fusion Transformer (STFT) for audio-visual zero-shot learning (ZSL). The STFT leverage the temporal and semantic information from different time steps to generate robust representations. The time-step factor (TSF) is introduced to dynamically synthesis the subsequent inference information. To guide the formation of input membrane potentials and reduce the spike noise, we propose a global-local pooling (GLP) which combines the max and average pooling operations. Furthermore, the thresholds of the spiking neurons…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Reservoir Computing · Advanced Adaptive Filtering Techniques · Blind Source Separation Techniques
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Residual Connection · Byte Pair Encoding · Layer Normalization · Average Pooling · Spiking Neural Networks · Label Smoothing
