SPARTAN: Sparse Hierarchical Memory for Parameter-Efficient Transformers
Ameet Deshpande, Md Arafat Sultan, Anthony Ferritto, Ashwin Kalyan,, Karthik Narasimhan, Avirup Sil

TL;DR
SPARTAN introduces a hierarchical sparse memory architecture for transformers that enables efficient fine-tuning on edge devices by only updating memory components, significantly reducing storage and increasing inference speed.
Contribution
It proposes a novel hierarchical sparse memory design that allows parameter-efficient fine-tuning of pre-trained language models on edge devices, outperforming existing methods in speed and comparable accuracy.
Findings
Over 90% inference speedup on Raspberry Pi 4
Outperforms PE baselines by 0.1 points on GLUE
Trains 34% faster in few-shot settings
Abstract
Fine-tuning pre-trained language models (PLMs) achieves impressive performance on a range of downstream tasks, and their sizes have consequently been getting bigger. Since a different copy of the model is required for each task, this paradigm is infeasible for storage-constrained edge devices like mobile phones. In this paper, we propose SPARTAN, a parameter efficient (PE) and computationally fast architecture for edge devices that adds hierarchically organized sparse memory after each Transformer layer. SPARTAN freezes the PLM parameters and fine-tunes only its memory, thus significantly reducing storage costs by re-using the PLM backbone for different tasks. SPARTAN contains two levels of memory, with only a sparse subset of parents being chosen in the first level for each input, and children cells corresponding to those parents being used to compute an output representation. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Adam · Absolute Position Encodings · Softmax · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing
