Native LLM and MLLM Inference at Scale on Apple Silicon
Wayner Barrios

TL;DR
This paper introduces vllm-mlx, a native framework for efficient large language and multimodal model inference on Apple Silicon, achieving significant throughput improvements and innovative caching techniques.
Contribution
The paper presents vllm-mlx, a native Apple Silicon inference framework with content-based prefix caching for multimodal models, surpassing existing tools in throughput and latency.
Findings
21-87% higher throughput than llama-cpp for text models
Up to 4.3x aggregate throughput with continuous batching
28x speedup on repeated image queries and 24.7x cache speedup for video
Abstract
The growing adoption of Apple Silicon for machine learning development has created demand for efficient inference solutions that leverage its unique unified memory architecture. However, existing tools either lack native optimization (PyTorch MPS) or focus solely on text models, leaving multimodal workloads underserved. We present vllm-mlx, a framework for efficient LLM and MLLM inference on Apple Silicon built natively on MLX. For text models, we achieve 21\% to 87\% higher throughput than llama-cpp across models ranging from Qwen3-0.6B to Nemotron-30B, while providing continuous batching that scales to 4.3x aggregate throughput at 16 concurrent requests. For multimodal models, we introduce content-based prefix caching that eliminates redundant vision encoding by identifying identical images through content hashing, regardless of input format. Our evaluation on Apple M4 Max…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Natural Language Processing Techniques
