Native LLM and MLLM Inference at Scale on Apple Silicon

Wayner Barrios

arXiv:2601.19139·cs.LG·January 30, 2026

Native LLM and MLLM Inference at Scale on Apple Silicon

Wayner Barrios

PDF

Open Access

TL;DR

This paper introduces vllm-mlx, a native framework for efficient large language and multimodal model inference on Apple Silicon, achieving significant throughput improvements and innovative caching techniques.

Contribution

The paper presents vllm-mlx, a native Apple Silicon inference framework with content-based prefix caching for multimodal models, surpassing existing tools in throughput and latency.

Findings

01

21-87% higher throughput than llama-cpp for text models

02

Up to 4.3x aggregate throughput with continuous batching

03

28x speedup on repeated image queries and 24.7x cache speedup for video

Abstract

The growing adoption of Apple Silicon for machine learning development has created demand for efficient inference solutions that leverage its unique unified memory architecture. However, existing tools either lack native optimization (PyTorch MPS) or focus solely on text models, leaving multimodal workloads underserved. We present vllm-mlx, a framework for efficient LLM and MLLM inference on Apple Silicon built natively on MLX. For text models, we achieve 21\% to 87\% higher throughput than llama-cpp across models ranging from Qwen3-0.6B to Nemotron-30B, while providing continuous batching that scales to 4.3x aggregate throughput at 16 concurrent requests. For multimodal models, we introduce content-based prefix caching that eliminates redundant vision encoding by identifying identical images through content hashing, regardless of input format. Our evaluation on Apple M4 Max…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Natural Language Processing Techniques