Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra
Yoichi Ochiai

TL;DR
This paper systematically optimizes real-time diffusion model inference on Apple M3 Ultra, revealing unique challenges and solutions distinct from CUDA-based platforms, achieving 22.7 FPS for image transformation.
Contribution
It demonstrates that optimization strategies effective on CUDA are not directly applicable to Apple Silicon, providing practical guidelines for diffusion inference on this platform.
Findings
Quantization does not speed up inference on Apple Silicon.
Parallel inference is ineffective on Apple Silicon's architecture.
Combining CoreML conversion with a 3-thread pipeline achieves real-time performance.
Abstract
While real-time image generation using diffusion models has advanced rapidly on NVIDIA GPUs, systematic optimization research on non-CUDA platforms such as Apple Silicon remains extremely limited. In this study, we conducted comprehensive optimization experiments across 10 phases targeting the Apple M3 Ultra (60-core GPU, 512 GB unified memory) with the goal of achieving real-time camera img2img transformation. We explored a wide range of techniques including CoreML conversion, quantization, Token Merging, Neural Engine utilization, compact model exploration, frame interpolation, kNN search-based synthesis, pix2pix-turbo, optical flow frame skipping, and knowledge distillation, quantitatively evaluating the effectiveness of each approach. Ultimately, by combining CoreML conversion of the distillation-specialized model SDXS-512 with a 3-thread camera pipeline, we achieved real-time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
