HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture
Taiqiang Wu, Chenchen Ding, Wenyong Zhou, Yuxin Cheng, Xincheng Feng, Shuqi Wang, Wendong Xu, Chufan Shi, Zhengwu Liu, Ngai Wong

TL;DR
HaLoRA is a novel hardware-aware low-rank adaptation method that enables energy-efficient deployment of large language models on hybrid compute-in-memory architectures while maintaining high accuracy.
Contribution
The paper introduces HaLoRA, a robust training approach for LoRA branches on CIM architectures, reducing energy costs significantly with minimal accuracy loss.
Findings
Achieves up to 22.7% improvement in average score on reasoning tasks.
Reduces energy cost to about 3% of Nvidia A100 GPU.
Maintains robustness across various noise types and levels.
Abstract
Low-rank adaptation (LoRA) is a predominant parameter-efficient finetuning method for adapting large language models (LLMs) to downstream tasks. Meanwhile, Compute-in-Memory (CIM) architectures demonstrate superior energy efficiency due to their array-level parallel in-memory computing designs. In this paper, we propose deploying the LoRA-finetuned LLMs on the hybrid CIM architecture (i.e., pretrained weights onto energy-efficient Resistive Random-Access Memory (RRAM) and LoRA branches onto noise-free Static Random-Access Memory (SRAM)), reducing the energy cost to about 3\% compared to the Nvidia A100 GPU. However, the inherent noise of RRAM on the saved weights leads to performance degradation, simultaneously. To address this issue, we design a novel Hardware-aware Low-rank Adaptation (HaLoRA) method. The key insight is to train a LoRA branch that is robust toward such noise and then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
MethodsLLaMA
