Sparse High Rank Adapters
Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Viswanath, Ganapathy, Shreya Kadambi, Rafael Esteves, Shubhankar Borse, Paul Whatmough,, Risheek Garrepalli, Mart Van Baalen, Harris Teague, Markus Nagel

TL;DR
SHiRA introduces a sparse, high-rank adapter method that enables rapid model switching, reduces concept-loss, and maintains low inference overhead by fine-tuning only 1-2% of model weights, outperforming LoRA.
Contribution
The paper proposes SHiRA, a novel sparse adapter that allows fast switching and multi-adapter fusion with minimal parameter tuning, improving over LoRA in efficiency and concept retention.
Findings
SHiRA achieves up to 16x faster loading than LoRA on CPU.
Fine-tuning 1-2% of parameters yields better performance than LoRA.
SHiRA reduces concept-loss in multi-adapter scenarios.
Abstract
Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research. One of the main advantages of LoRA is its ability to be fused with pretrained models, adding no overhead during inference. However, from a mobile deployment standpoint, we can either avoid inference overhead in the fused mode but lose the ability to switch adapters rapidly, or suffer significant (up to 30% higher) inference latency while enabling rapid switching in the unfused mode. LoRA also exhibits concept-loss when multiple adapters are used concurrently. In this paper, we propose Sparse High Rank Adapters (SHiRA), a new paradigm which incurs no inference overhead, enables rapid switching, and significantly reduces concept-loss. Specifically, SHiRA can be trained by directly tuning only 1-2% of the base model weights while leaving others unchanged. This results in a highly sparse adapter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Neural Networks and Reservoir Computing
MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Lib · Adapter · Balanced Selection
