Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

Wenrui Li; Penghong Wang; Xingtao Wang; Wangmeng Zuo; Xiaopeng Fan; Yonghong Tian

arXiv:2505.19938·cs.CV·May 27, 2025

Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

Wenrui Li, Penghong Wang, Xingtao Wang, Wangmeng Zuo, Xiaopeng Fan, Yonghong Tian

PDF

Open Access

TL;DR

This paper introduces MDST++, a novel spiking transformer model that decouples semantic and motion information across multiple timescales, improving zero-shot learning for audio-visual video classification by reducing background bias and capturing dynamic motion more effectively.

Contribution

The paper proposes a dual-stream spiking transformer with motion decoupling and adaptive neuron thresholds, enhancing zero-shot learning performance and motion understanding in video data.

Findings

01

MDST++ outperforms state-of-the-art methods on benchmarks.

02

Incorporating motion and multi-timescale info boosts HM and ZSL accuracy by 26.2% and 39.9%.

03

Adaptive thresholding improves temporal and motion cue extraction.

Abstract

Audio-visual zero-shot learning (ZSL) has been extensively researched for its capability to classify video data from unseen classes during training. Nevertheless, current methodologies often struggle with background scene biases and inadequate motion detail. This paper proposes a novel dual-stream Multi-Timescale Motion-Decoupled Spiking Transformer (MDST++), which decouples contextual semantic information and sparse dynamic motion information. The recurrent joint learning unit is proposed to extract contextual semantic information and capture joint knowledge across various modalities to understand the environment of actions. By converting RGB images to events, our method captures motion information more accurately and mitigates background scene biases. Moreover, we introduce a discrepancy analysis block to model audio motion information. To enhance the robustness of SNNs in extracting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing · Speech and Audio Processing · Hearing Loss and Rehabilitation

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing