Exploring Extreme Quantization in Spiking Language Models
Malyaban Bal, Yi Jiang, Abhronil Sengupta

TL;DR
This paper introduces a novel ultra-quantized spiking language model architecture that significantly reduces energy consumption while maintaining performance, using knowledge distillation from full-precision models.
Contribution
It develops the first 1/1.58-bit spiking language model using knowledge distillation, advancing energy-efficient NLP models with scalable architecture.
Findings
Achieves competitive performance on GLUE benchmark tasks.
Demonstrates effective knowledge transfer from full-precision models.
Presents a scalable, ultra-quantized spiking LM architecture.
Abstract
Despite the growing prevalence of large language model (LLM) architectures, a crucial concern persists regarding their energy and power consumption, which still lags far behind the remarkable energy efficiency of the human brain. Recent strides in spiking language models (LM) and transformer architectures aim to address this concern by harnessing the spiking activity of biological neurons to enhance energy/power efficiency. Doubling down on the principles of model quantization and energy efficiency, this paper proposes the development of a novel binary/ternary (1/1.58-bit) spiking LM architecture. Achieving scalability comparable to a deep spiking LM architecture is facilitated by an efficient knowledge distillation technique, wherein knowledge from a non-spiking full-precision "teacher" model is transferred to an extremely weight quantized spiking "student" LM. Our proposed model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · DNA and Biological Computing · Neural Networks and Applications
MethodsKnowledge Distillation
