CCD-Level and Load-Aware Thread Orchestration for In-Memory Vector ANNS on Multi-Core CPUs
Yuchen Huang, Baiteng Ma, Yiping Sun, Yang Shi, Xiao Chen, Xiaocheng Zhong, Zhiyong Wang, Yao Hu, Chuliang Weng

TL;DR
This paper introduces a CCD-level, load-aware thread orchestration framework for in-memory vector ANNS on multi-core CPUs, significantly improving throughput and latency by optimizing task dispatching and cache utilization.
Contribution
It proposes a novel CCD-aware thread orchestration framework that enhances performance of vector ANNS workloads on multi-chiplet CPUs by workload and hardware-aware task management.
Findings
Achieves up to 3.7x higher throughput in production workloads.
Reduces P50 and P999 latency by 30-90%.
Decreases cache-miss ratio by 6-30% and CPU stalls by 20-80%.
Abstract
Vector approximate nearest neighbor search (ANNS) underpins search engines, recommendation systems, and advertising services. Recent advances in ANNS indexes make CPU a cost-effective choice for serving million-scale, in-memory vector search, yet per-core throughput remains constrained by memory access latency of vector reading and the compute intensity of distance evaluations in production deployments. With the growing scale of the business and advances in hardware, modern CCD-based multi-core CPUs have been widely deployed for high throughput in our services. However, we find that simply increasing core counts does not yield optimal performance scaling. To improve the efficiency of more cores from the CCD-based architecture, we analyze the distributions of real-world requests in our production environments. We observe high access locality in vector search in our online services and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
