QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models
Rachmad Vidya Wicaksana Putra, Pasindu Wickramasinghe, Muhammad Shafique

TL;DR
QSLM is an automated quantization framework for spike-driven language models that significantly reduces memory and power consumption while maintaining high task performance, enabling efficient deployment on resource-constrained devices.
Contribution
It introduces a tiered quantization strategy with a multi-objective optimization to efficiently compress pre-trained SLMs under performance and memory constraints.
Findings
Memory footprint reduced by up to 86.5%.
Power consumption decreased by up to 20%.
Maintains high accuracy close to original models.
Abstract
Large Language Models (LLMs) have been emerging as prominent AI models for solving many natural language tasks due to their high performance (e.g., accuracy) and capabilities in generating high-quality responses to the given inputs. However, their large computational cost, huge memory footprints, and high processing power/energy make it challenging for their embedded deployments. Amid several tinyLLMs, recent works have proposed spike-driven language models (SLMs) for significantly reducing the processing power/energy of LLMs. However, their memory footprints still remain too large for low-cost and resource-constrained embedded devices. Manual quantization approach may effectively compress SLM memory footprints, but it requires a huge design time and compute power to find the quantization setting for each network, hence making this approach not-scalable for handling different networks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
