IML-Spikeformer: Input-aware Multi-Level Spiking Transformer for Speech Processing
Zeyang Song, Shimin Zhang, Yuhong Chou, Jibin Wu, Haizhou Li

TL;DR
The paper introduces IML-Spikeformer, a novel spiking Transformer architecture for large-scale speech processing that improves performance and energy efficiency by simulating multi-timestep spike firing within a single timestep.
Contribution
It proposes the Input-aware Multi-Level Spike mechanism and a re-parameterized self-attention module with hierarchical decay mask, advancing scalable SNN architectures for speech tasks.
Findings
Achieves competitive word error rates on AiShell-1 and Librispeech-960.
Reduces inference energy consumption by over 4 times.
Demonstrates scalable performance of SNNs in large-scale speech processing.
Abstract
Spiking Neural Networks (SNNs), inspired by biological neural mechanisms, represent a promising neuromorphic computing paradigm that offers energy-efficient alternatives to traditional Artificial Neural Networks (ANNs). Despite proven effectiveness, SNN architectures have struggled to achieve competitive performance on large-scale speech processing tasks. Two key challenges hinder progress: (1) the high computational overhead during training caused by multi-timestep spike firing, and (2) the absence of large-scale SNN architectures tailored to speech processing tasks. To overcome the issues, we introduce Input-aware Multi-Level Spikeformer, i.e. IML-Spikeformer, a spiking Transformer architecture specifically designed for large-scale speech processing. Central to our design is the Input-aware Multi-Level Spike (IMLS) mechanism, which simulates multi-timestep spike firing within a single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
