Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM
Sravanth Kodavanti, Sowmya Vajrala, Srinivas Miriyala, Utsav Tiwari, Uttam Kumar, Utkarsh Kumar Mahawar, Achal Pratap Singh, Arya D, Narendra Mutyala, Vikram Nelvoy Rajendiran, Sharan Kumar Allur, Euntaik Lee, Dohyoung Kim, HyeonSu Lee, Gyusung Cho, JungBae Kim

TL;DR
This paper presents a hardware-aware framework enabling efficient on-device deployment of multilingual LLMs on smartphones, achieving significant improvements in latency, memory, and flexibility through innovative techniques like multi-stream decoding and Dynamic Self-Speculative Decoding.
Contribution
It introduces a novel on-device inference system supporting multiple use cases with dynamic task switching, multi-stream decoding, and token prediction acceleration, optimized for mobile hardware.
Findings
Achieves 4-6x reduction in memory and latency on mobile devices.
Reduces decoding time by up to 6x with multi-stream decoding.
Maintains accuracy across 9 languages and 8 tasks.
Abstract
Deploying large language models (LLMs) on smartphones poses significant engineering challenges due to stringent constraints on memory, latency, and runtime flexibility. In this work, we present a hardware-aware framework for efficient on-device inference of a LLaMA-based multilingual foundation model supporting multiple use cases on Samsung Galaxy S24 and S25 devices with SM8650 and SM8750 Qualcomm chipsets respectively. Our approach integrates application-specific LoRAs as runtime inputs to a single frozen inference graph, enabling dynamic task switching without recompilation or memory overhead. We further introduce a multi-stream decoding mechanism that concurrently generates stylistic variations - such as formal, polite, or jovial responses - within a single forward pass, reducing latency by up to 6x. To accelerate token generation, we apply Dynamic Self-Speculative Decoding (DS2D),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
