Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning

Wenrui Li; Penghong Wang; Ruiqin Xiong; Xiaopeng Fan

arXiv:2407.08130·cs.MM·July 12, 2024·2 cites

Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning

Wenrui Li, Penghong Wang, Ruiqin Xiong, Xiaopeng Fan

PDF

Open Access

TL;DR

This paper introduces a novel Spiking Tucker Fusion Transformer that combines SNNs and transformers to improve audio-visual zero-shot learning by effectively integrating temporal and semantic information, achieving state-of-the-art results.

Contribution

The paper proposes a new Spiking Tucker Fusion Transformer with dynamic inference, global-local pooling, and a Tucker fusion module for multi-scale SNN and transformer integration in ZSL.

Findings

01

Achieves state-of-the-art performance on three benchmark datasets.

02

Significant harmonic mean improvements of 15.4%, 3.9%, and 14.9%.

03

Demonstrates effective fusion of temporal and semantic features.

Abstract

The spiking neural networks (SNNs) that efficiently encode temporal sequences have shown great potential in extracting audio-visual joint feature representations. However, coupling SNNs (binary spike sequences) with transformers (float-point sequences) to jointly explore the temporal-semantic information still facing challenges. In this paper, we introduce a novel Spiking Tucker Fusion Transformer (STFT) for audio-visual zero-shot learning (ZSL). The STFT leverage the temporal and semantic information from different time steps to generate robust representations. The time-step factor (TSF) is introduced to dynamically synthesis the subsequent inference information. To guide the formation of input membrane potentials and reduce the spike noise, we propose a global-local pooling (GLP) which combines the max and average pooling operations. Furthermore, the thresholds of the spiking neurons…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing · Advanced Adaptive Filtering Techniques · Blind Source Separation Techniques

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Residual Connection · Byte Pair Encoding · Layer Normalization · Average Pooling · Spiking Neural Networks · Label Smoothing