SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling

Zikun Liu; Liang Luo; Qianru Li; Zhengyu Zhang; Wei Ling; Jingyi Shen; Zeliang Chen; Yaning Huang; Jingxian Huang; Abdallah Aboelela; Chonglin Sun; Feifan Gu; Fenggang Wu; Hang Qu; Huayu Li; Jill Pan; Kaidi Pei; Laming Chen; Longhao Jin; Qin Huang; Tongyi Tang; Varna Puvvada; Wenlin Chen; Xiaohan Wei; Xu Cao; Yantao Yao; Yuan Jin; Yunchen Pu; Yuxin Chen; Zijian Shen; Zhengkai Zhang; Dong Liang; Ellie Wen

arXiv:2604.12110·cs.LG·April 15, 2026

SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling

Zikun Liu, Liang Luo, Qianru Li, Zhengyu Zhang, Wei Ling, Jingyi Shen, Zeliang Chen, Yaning Huang, Jingxian Huang, Abdallah Aboelela, Chonglin Sun, Feifan Gu, Fenggang Wu, Hang Qu, Huayu Li, Jill Pan, Kaidi Pei, Laming Chen, Longhao Jin, Qin Huang, Tongyi Tang, Varna Puvvada

PDF

TL;DR

SOLARIS is a framework that precomputes likely user-item interactions to enable faster inference in recommendation systems, improving real-time serving without sacrificing model quality.

Contribution

It introduces a speculative offloading method that precomputes embeddings, decoupling expensive inference from latency-critical serving, and demonstrates scalability in a large-scale deployment.

Findings

01

Achieves 0.67% revenue increase in Meta's advertising system

02

Enables real-time inference for complex foundation models

03

Decouples inference from serving latency constraints

Abstract

Recent advances in recommendation scaling laws have led to foundation models of unprecedented complexity. While these models offer superior performance, their computational demands make real-time serving impractical, often forcing practitioners to rely on knowledge distillation-compromising serving quality for efficiency. To address this challenge, we present SOLARIS (Speculative Offloading of Latent-bAsed Representation for Inference Scaling), a novel framework inspired by speculative decoding. SOLARIS proactively precomputes user-item interaction embeddings by predicting which user-item pairs are likely to appear in future requests, and asynchronously generating their foundation model representations ahead of time. This approach decouples the costly foundation model inference from the latency-critical serving path, enabling real-time knowledge transfer from models previously…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.