MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale Deployment

Hanxian Huang; Igor Fedorov; Andrey Gromov; Bernard Beckerman; Naveen Suda; David Eriksson; Maximilian Balandat; Rylan Conway; Patrick Huber; Chinnadhurai Sankar; Ayushi Dalmia; Zechun Liu; Lemeng Wu; Tarek Elgamal; Adithya Sagar; Vikas Chandra; Raghuraman Krishnamoorthi

arXiv:2603.15954·cs.LG·April 29, 2026

MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale Deployment

Hanxian Huang, Igor Fedorov, Andrey Gromov, Bernard Beckerman, Naveen Suda, David Eriksson, Maximilian Balandat, Rylan Conway, Patrick Huber, Chinnadhurai Sankar, Ayushi Dalmia, Zechun Liu, Lemeng Wu, Tarek Elgamal, Adithya Sagar, Vikas Chandra, Raghuraman Krishnamoorthi

PDF

TL;DR

This paper introduces MobileLLM-Flash, a methodology for designing on-device large language models optimized for low latency and broad hardware compatibility, suitable for industry deployment.

Contribution

It presents a hardware-in-the-loop architecture search approach that jointly optimizes model architecture and attention patterns without specialized mechanisms.

Findings

01

MobileLLM-Flash models support up to 8k context length.

02

Achieves up to 1.8x faster prefill and 1.6x faster decode on mobile CPUs.

03

Provides a Pareto-frontier analysis guiding OD-LLM design.

Abstract

Real-time AI experiences call for on-device large language models (OD-LLMs) optimized for efficient deployment on resource-constrained hardware. The most useful OD-LLMs produce near-real-time responses and exhibit broad hardware compatibility, maximizing user reach. We present a methodology for designing such models using hardware-in-the-loop architecture search under mobile latency constraints. This system is amenable to industry-scale deployment: it generates models deployable without custom kernels and compatible with standard mobile runtimes like Executorch. Our methodology avoids specialized attention mechanisms and instead uses attention skipping for long-context acceleration. Our approach jointly optimizes model architecture (layers, dimensions) and attention pattern. To efficiently evaluate candidates, we treat each as a pruned version of a pretrained backbone with inherited…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.