Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective

Afsara Benazir; Felix Xiaozhu Lin

arXiv:2508.08531·cs.PF·August 13, 2025

Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective

Afsara Benazir, Felix Xiaozhu Lin

PDF

Open Access

TL;DR

This paper provides a comprehensive analysis of Apple Silicon's architecture and performance for large language model inference, comparing it with NVIDIA GPUs and debunking common misconceptions about hardware efficiency and quantization benefits.

Contribution

It offers the first thorough characterization of Apple Silicon for on-device LLM inference, including benchmarking, profiling, and analysis of hardware bottlenecks and resource utilization.

Findings

01

Apple Silicon's unified memory enhances cost-effectiveness and efficiency for large models.

02

Quantization does not universally guarantee faster inference across all hardware.

03

Performance bottlenecks include dequantization overhead and memory bandwidth limitations.

Abstract

A systematic understanding of Apple Silicon is lacking in the current landscape of hardware efficiency; research focus is largely centered on accelerating GPUs for large-scale training or inference on CUDA devices. This paper investigates Apple Silicon's unique memory architecture that offers a unified memory integrating CPU and GPU memory and its implications for on-device LLM inference. We decipher myths about whether Apple Silicon is efficient for on-device inference compared to competitors such as NVIDIA GPUs by directly conducting latency and throughput comparison benchmarks. We explain the performance gap between them through profiling low level hardware metrics - ALU utilization, memory bandwidth, buffer usage, cache residency etc. at runtime. We draw several insights regarding performance bottlenecks such as dequantization overhead, compute throughput and memory bandwidth. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy