Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving
Juntao Zhao, Jiuru Li, Chuan Wu

TL;DR
Sandwich is a comprehensive CPU LLM serving system that optimizes configuration and hot-switching, achieving significant speedups and latency reductions across various CPU platforms.
Contribution
It introduces a full-stack approach with phase-wise plan switching, hardware-aware core allocation, and dynamic tensor program generation for efficient CPU LLM serving.
Findings
Average 2.01x end-to-end speedup across five CPU platforms
Up to 3.40x latency reduction over state-of-the-art systems
Kernels match static compiler performance with much lower tuning cost
Abstract
CPUs are critical for LLM serving due to their availability, cost efficiency, and edge applicability. However, efficient CPU serving is hindered by conflicting prefill/decode resource demands under non-disaggregated deployment constraints--existing solutions fail to avoid cross-phase interference, ignore sub-NUMA hardware structures, and deliver suboptimal dynamic-shape kernel performance. We propose Sandwich, a full-stack CPU LLM serving system with three core innovations addressing these challenges: (1) seamless phase-wise plan switching to eliminate cross-phase interference; (2) TopoTree, a tree-based hardware abstraction for automated substructure-aware (e.g., LLC slices) partial core allocation; (3) fast-start-then-finetune dynamic-shape tensor program generation. Across five x86/ARM CPU platforms, Sandwich achieves an average 2.01x end-to-end speedup and up to 3.40x latency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
