Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of MLX, MLC-LLM, Ollama, llama.cpp, and PyTorch MPS

Varun Rajesh; Om Jodhpurkar; Pooja Anbuselvan; Mantinder Singh; Ashok Jallepali; Shantanu Godbole; Pradeep Kumar Sharma; Hritvik Shrivastava

arXiv:2511.05502·cs.AR·November 11, 2025

Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of MLX, MLC-LLM, Ollama, llama.cpp, and PyTorch MPS

Varun Rajesh, Om Jodhpurkar, Pooja Anbuselvan, Mantinder Singh, Ashok Jallepali, Shantanu Godbole, Pradeep Kumar Sharma, Hritvik Shrivastava

PDF

Open Access

TL;DR

This paper systematically compares five local LLM runtimes on Apple Silicon, evaluating their performance, features, and deployment aspects to guide efficient on-device AI applications.

Contribution

It provides a comprehensive empirical evaluation of leading Apple Silicon LLM frameworks, highlighting their strengths, limitations, and design trade-offs for production use.

Findings

01

MLX achieves highest sustained throughput.

02

MLC-LLM offers lower TTFT and strong inference features.

03

llama.cpp is efficient for lightweight use.

Abstract

We present a systematic, empirical evaluation of five local large language model (LLM) runtimes on Apple Silicon: MLX, MLC-LLM, llama.cpp, Ollama, and PyTorch MPS. Experiments were conducted on a Mac Studio equipped with an M2 Ultra processor and 192 GB of unified memory. Using the Qwen-2.5 model family across prompts ranging from a few hundred to 100,000 tokens, we measure time-to-first-token (TTFT), steady-state throughput, latency percentiles, long-context behavior (key-value and prompt caching), quantization support, streaming performance, batching and concurrency behavior, and deployment complexity. Under our settings, MLX achieves the highest sustained generation throughput, while MLC-LLM delivers consistently lower TTFT for moderate prompt sizes and offers stronger out-of-the-box inference features. llama.cpp is highly efficient for lightweight single-stream use, Ollama…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Software System Performance and Reliability · Advanced Data Storage Technologies