zFLoRA: Zero-Latency Fused Low-Rank Adapters
Dhananjaya Gowda, Seoha Song, Harshith Goka, Junhyun Lee

TL;DR
zFLoRA is a novel low-rank adapter method for large language models that achieves zero or negligible latency overhead during inference, outperforming traditional fine-tuning methods on multiple tasks and hardware platforms.
Contribution
The paper introduces zFLoRA, a zero-latency fused low-rank adapter that significantly reduces inference latency while maintaining competitive performance.
Findings
zFLoRA achieves zero or negligible latency overhead on NPU and GPU platforms.
Experimental results show zFLoRA outperforms LoRA and full fine-tuning on 18 tasks.
zFLoRA maintains high accuracy across diverse reasoning and dialogue tasks.
Abstract
Large language models (LLMs) are increasingly deployed with task-specific adapters catering to multiple downstream applications. In such a scenario, the additional compute associated with these apparently insignificant number of adapter parameters (typically less than 1% of the base model) turns out to be disproportionately significant during inference time (upto 2.5x times that of the base model). In this paper, we propose a new zero-latency fused low-rank adapter (zFLoRA) that introduces zero or negligible latency overhead on top of the base model. Experimental results on LLMs of size 1B, 3B and 7B show that zFLoRA compares favorably against the popular supervised fine-tuning benchmarks including low-rank adapters (LoRA) as well as full fine-tuning (FFT). Experiments are conducted on 18 different tasks across three different categories namely commonsense reasoning, math reasoning and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
