SpikCommander: A High-performance Spiking Transformer with Multi-view Learning for Efficient Speech Command Recognition

Jiaqi Wang; Liutao Yu; Xiongri Shen; Sihang Guo; Chenlin Zhou; Leilei Zhao; Yi Zhong; Zhiguo Zhang; Zhengyu Ma

arXiv:2511.07883·cs.SD·January 21, 2026

SpikCommander: A High-performance Spiking Transformer with Multi-view Learning for Efficient Speech Command Recognition

Jiaqi Wang, Liutao Yu, Xiongri Shen, Sihang Guo, Chenlin Zhou, Leilei Zhao, Yi Zhong, Zhiguo Zhang, Zhengyu Ma

PDF

Open Access

TL;DR

SpikCommander introduces a novel spiking transformer architecture with multi-view learning and temporal-aware attention, significantly improving energy-efficient speech command recognition performance on benchmark datasets.

Contribution

The paper proposes SpikCommander, a fully spike-driven transformer with MSTASA and SCR-MLP, advancing temporal modeling and feature integration in SNN-based speech recognition.

Findings

01

Outperforms state-of-the-art SNN methods on SHD, SSC, and GSC datasets.

02

Uses fewer parameters while maintaining high accuracy.

03

Demonstrates robustness and efficiency in speech command recognition.

Abstract

Spiking neural networks (SNNs) offer a promising path toward energy-efficient speech command recognition (SCR) by leveraging their event-driven processing paradigm. However, existing SNN-based SCR methods often struggle to capture rich temporal dependencies and contextual information from speech due to limited temporal modeling and binary spike-based representations. To address these challenges, we first introduce the multi-view spiking temporal-aware self-attention (MSTASA) module, which combines effective spiking temporal-aware attention with a multi-view learning framework to model complementary temporal dependencies in speech commands. Building on MSTASA, we further propose SpikCommander, a fully spike-driven transformer architecture that integrates MSTASA with a spiking contextual refinement channel MLP (SCR-MLP) to jointly enhance temporal context modeling and channel-wise feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Hearing Loss and Rehabilitation