MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases
Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar,, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan,, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang,, Caiming Xiong, Silvio Savarese

TL;DR
MobileAIBench is a benchmarking framework designed to evaluate the performance, resource consumption, and safety of quantized LLMs and LMMs on mobile devices, addressing the need for systematic testing tools in this domain.
Contribution
It introduces a comprehensive, open-source benchmarking framework for assessing mobile-optimized LLMs and LMMs across various tasks, sizes, and quantization levels.
Findings
Quantization impacts task performance and safety metrics.
MobileAIBench enables systematic evaluation of latency and resource use.
The framework supports on-device testing with real hardware.
Abstract
The deployment of Large Language Models (LLMs) and Large Multimodal Models (LMMs) on mobile devices has gained significant attention due to the benefits of enhanced privacy, stability, and personalization. However, the hardware constraints of mobile devices necessitate the use of models with fewer parameters and model compression techniques like quantization. Currently, there is limited understanding of quantization's impact on various task performances, including LLM tasks, LMM tasks, and, critically, trust and safety. There is a lack of adequate tools for systematically testing these models on mobile devices. To address these gaps, we introduce MobileAIBench, a comprehensive benchmarking framework for evaluating mobile-optimized LLMs and LMMs. MobileAIBench assesses models across different sizes, quantization levels, and tasks, measuring latency and resource consumption on real…
Peer Reviews
Decision·Submitted to ICLR 2025
- Exciting new topic. Running LLMs (and possibly LMMs) on-device is important for enhanced privacy, and, under certain conditions, it provides enhanced performance and UX. - I appreciate the extensive evaluation metrics, on top of the standard performance utilization: trust and safety, but also the qualitative metrics under a wide range of NLP/LMM tasks. - Good quality of writing, figures etc. I do have a few suggestions for further improvements, mentioned next.
- Missing important related works: * MELTing point: Mobile Evaluation of Language Transformers, from Laskaridis et al. * Small Language Models: Survey, Measurements, and Insights, from Zhenyan Lu et al. - Authors are claiming that this is the first work "to provide a thorough benchmarking and analysis of open source LLMs". I suggest they reduce such strong claims. - I would expect from a paper like this to list (if not include in the evaluation) the available on-device frameworks instead of
a) Real-world experiments. b) By measuring latency and hardware resource usage on the iPhone 14, the study provides insights into the performance of smaller models. c) The paper introduces an open-source tool that facilitates convenient testing of small models. d) Writing is clear and fluency.
a) The experiments were conducted only on the iPhone 14, lacking evaluations on newer and more diverse devices. Currently, there are more mobile devices optimized specifically for on-device AI, such as the Snapdragon 8 Gen 3. Including these devices in testing would provide a more comprehensive view of model performance under different hardware conditions, offering broader insights for on-device AI applications. b) In the section 4.3, the number of models tested is limited, failing to cover a w
S1 - This work makes a valuable contribution by expanding the community's understanding of models’ behaviors with quantization. This offers several analyses that will be beneficial for further research in this area, especially when deploying the model on mobile devices. S2 - The open-sourced experimental platform is highly meaningful, allowing others to reproduce the work easily. S3 - For analyses, the authors have taken a comprehensive approach by considering various evaluation axes. For exa
W1 - The current set of tasks can be limited. With the growing interest in UI-based control for digital devices (such as Cluade-3.5 for computer use), it would be beneficial to include related tasks. Have the authors considered incorporating AndroidWorld (Rawles et al., 2024) for general capability assessment or MobileSafetyBench (Lee et al., 2024) for evaluating the safety of agents controlling mobile devices? W2 - Relying solely on VQA for multimodal tasks may restrict the scope of analysis.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Digital Rights Management and Security · Business Process Modeling and Analysis
MethodsLib
