SpikeBERT: A Language Spikformer Learned from BERT with Knowledge Distillation
Changze Lv, Tianlong Li, Jianhan Xu, Chenxi Gu, Zixuan Ling, Cenyuan, Zhang, Xiaoqing Zheng, Xuanjing Huang

TL;DR
SpikeBERT introduces a deep spiking Transformer for language tasks, trained via a two-stage knowledge distillation from BERT, achieving competitive accuracy with lower energy consumption.
Contribution
The paper develops SpikeBERT, a novel deep spiking Transformer model for language understanding, trained with a two-stage knowledge distillation from BERT, enabling efficient and effective language processing.
Findings
SpikeBERT outperforms existing SNNs on text classification.
SpikeBERT achieves comparable results to BERT with less energy.
Two-stage knowledge distillation improves SNN training for language tasks.
Abstract
Spiking neural networks (SNNs) offer a promising avenue to implement deep neural networks in a more energy-efficient way. However, the network architectures of existing SNNs for language tasks are still simplistic and relatively shallow, and deep architectures have not been fully explored, resulting in a significant performance gap compared to mainstream transformer-based networks such as BERT. To this end, we improve a recently-proposed spiking Transformer (i.e., Spikformer) to make it possible to process language tasks and propose a two-stage knowledge distillation method for training it, which combines pre-training by distilling knowledge from BERT with a large collection of unlabelled texts and fine-tuning with task-specific instances via knowledge distillation again from the BERT fine-tuned on the same training examples. Through extensive experimentation, we show that the models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices · Robotics and Automated Systems
MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Label Smoothing · Absolute Position Encodings · Transformer · Linear Layer · Layer Normalization · Refunds@Expedia|||How do I get a full refund from Expedia? · Dropout
