MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale Deployment
Hanxian Huang, Igor Fedorov, Andrey Gromov, Bernard Beckerman, Naveen Suda, David Eriksson, Maximilian Balandat, Rylan Conway, Patrick Huber, Chinnadhurai Sankar, Ayushi Dalmia, Zechun Liu, Lemeng Wu, Tarek Elgamal, Adithya Sagar, Vikas Chandra, Raghuraman Krishnamoorthi

TL;DR
This paper introduces MobileLLM-Flash, a methodology for designing on-device large language models optimized for low latency and broad hardware compatibility, suitable for industry deployment.
Contribution
It presents a hardware-in-the-loop architecture search approach that jointly optimizes model architecture and attention patterns without specialized mechanisms.
Findings
MobileLLM-Flash models support up to 8k context length.
Achieves up to 1.8x faster prefill and 1.6x faster decode on mobile CPUs.
Provides a Pareto-frontier analysis guiding OD-LLM design.
Abstract
Real-time AI experiences call for on-device large language models (OD-LLMs) optimized for efficient deployment on resource-constrained hardware. The most useful OD-LLMs produce near-real-time responses and exhibit broad hardware compatibility, maximizing user reach. We present a methodology for designing such models using hardware-in-the-loop architecture search under mobile latency constraints. This system is amenable to industry-scale deployment: it generates models deployable without custom kernels and compatible with standard mobile runtimes like Executorch. Our methodology avoids specialized attention mechanisms and instead uses attention skipping for long-context acceleration. Our approach jointly optimizes model architecture (layers, dimensions) and attention pattern. To efficiently evaluate candidates, we treat each as a pruned version of a pretrained backbone with inherited…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
