Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation
Irene Tenison, Stella Ahn, Miriam Kim, Ebtisam Alshehri, Lalana Kagal

TL;DR
This paper introduces LARS, a new fine-tuning method for large language models that significantly reduces memory usage on various hardware, enabling better on-device adaptation without sacrificing accuracy.
Contribution
LARS decouples memory consumption from sequence length by constraining activation subspaces, improving memory efficiency over existing PEFT methods like LoRA.
Findings
LARS reduces memory footprint by 33.54% on GPUs and 51.95% on CPUs compared to LoRA.
LARS maintains competitive accuracy and throughput across multiple datasets and models.
LARS enables scalable LLM personalization on resource-constrained devices like Raspberry Pi.
Abstract
Parameter-Efficient Fine-Tuning (PEFT) has become the standard for adapting large language models (LLMs). In this work we challenge the wide-spread assumption that parameter efficiency equates memory efficiency and on-device adaptability. We show that this is not true - while methods like LoRA and IA3 significantly reduce trainable parameters, they remain bound by intermediate tensors that scale linearly with sequence length, often triggering out-of-memory errors on-device. In this work, we introduce LARS (Low-memory Activation-Rank Subspace), a novel adaptation framework that decouples memory consumption from sequence length. While prior PEFT methods apply low-rank constraints to model parameters, LARS instead constrains the activation subspace used during training, directly targeting the dominant source of memory consumption and fundamentally flattening the memory growth rate. LARS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
