SilverTorch: A Unified Model-based System to Democratize Large-Scale Recommendation on GPUs

Bi Xue; Hong Wu; Lei Chen; Chao Yang; Yiming Ma; Fei Ding; Zhen Wang; Liang Wang; Xiaoheng Mao; Ke Huang; Xialu Li; Peng Xia; Rui Jian; Yanli Zhao; Yanzun Huang; Yijie Deng; Harry Tran; Ryan Chang; Min Yu; Eric Dong; Jiazhou Wang; Qianqian Zhang; Keke Zhai; Hongzhang Yin; Pawel Garbacki; Jiaqi Zhai; Zheng Fang; Yiyi Pan; Min Ni; Kevin Greer; Rui Zhang; Yang Liu

arXiv:2511.14881·cs.IR·May 11, 2026

SilverTorch: A Unified Model-based System to Democratize Large-Scale Recommendation on GPUs

Bi Xue, Hong Wu, Lei Chen, Chao Yang, Yiming Ma, Fei Ding, Zhen Wang, Liang Wang, Xiaoheng Mao, Ke Huang, Xialu Li, Peng Xia, Rui Jian, Yanli Zhao, Yanzun Huang, Yijie Deng, Harry Tran, Ryan Chang, Min Yu, Eric Dong, Jiazhou Wang, Qianqian Zhang, Keke Zhai, Hongzhang Yin

PDF

TL;DR

SilverTorch is a unified GPU-based recommendation system that replaces traditional CPU indexing with model layers, significantly improving throughput, cost-efficiency, and supporting complex models at scale.

Contribution

It introduces a novel unified model-based serving system with GPU-optimized indexing and retrieval, enabling scalable, accurate, and cost-efficient recommendation serving.

Findings

01

Achieves up to 23.7× higher throughput than state-of-the-art methods.

02

Is 13.35× more cost-efficient than CPU-based solutions.

03

Supports complex models and multi-task retrieval at industry scale.

Abstract

Serving deep learning based recommendation models (DLRM) at scale is challenging. Existing approaches rely on dedicated ANN indexing and filtering services on CPUs, suffering from non-negligible costs and missing co-design opportunities. Such inefficiency makes them difficult to support complex model architectures, such as learned similarities and multi-task retrieval. In this paper, we present SilverTorch, a model-based serving system that brings all components into one unified model. It unifies model serving by replacing standalone indexing and filtering services with model layers. We propose a model-based GPU Bloom index for feature filtering and a fused Int8 ANN kernel for nearest neighbor search. Through co-design of the ANN search and feature filtering, we reduce GPU memory usage and eliminate computation. Benefiting from this design, we scale up retrieval by introducing an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.