Vector-Centric Machine Learning Systems: A Cross-Stack Approach
Wenqi Jiang

TL;DR
This paper presents a cross-stack approach to improving the efficiency of vector-centric ML systems, including algorithms, systems, and hardware optimizations for RAG, vector search, and recommender systems.
Contribution
It introduces novel algorithms, system designs, and hardware co-design strategies specifically targeting the efficiency of vector-based ML applications.
Findings
Enhanced RAG serving efficiency with PipeRAG, RAGO, and Chameleon.
Optimized vector search algorithms FANNS and Falcon for hardware efficiency.
Improved recommender system performance with MicroRec and FleetRec.
Abstract
Today, two major trends are shaping the evolution of ML systems. First, modern AI systems are becoming increasingly complex, often integrating components beyond the model itself. A notable example is Retrieval-Augmented Generation (RAG), which incorporates not only multiple models but also vector databases, leading to heterogeneity in both system components and underlying hardware. Second, with the end of Moore's Law, achieving high system efficiency is no longer feasible without accounting for the rapid evolution of the hardware landscape. Building on the observations above, this thesis adopts a cross-stack approach to improving ML system efficiency, presenting solutions that span algorithms, systems, and hardware. First, it introduces several pioneering works about RAG serving efficiency across the computing stack. PipeRAG focuses on algorithm-level improvements, RAGO introduces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
