Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative Recommenders

Zhimin Chen; Chenyu Zhao; Ka Chun Mo; Yunjiang Jiang; Jane H. Lee; Khushhall Chandra Mahajan; Ning Jiang; Kai Ren; Jinhui Li; Wen-Yun Yang

arXiv:2510.22049·cs.IR·March 27, 2026

Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative Recommenders

Zhimin Chen, Chenyu Zhao, Ka Chun Mo, Yunjiang Jiang, Jane H. Lee, Khushhall Chandra Mahajan, Ning Jiang, Kai Ren, Jinhui Li, Wen-Yun Yang

PDF

3 Reviews

TL;DR

This paper introduces VISTA, a two-stage model that efficiently handles ultra-long user histories in recommendation systems, enabling scalable, cost-effective training and inference at industry scale.

Contribution

The paper proposes a novel two-stage framework, VISTA, that decomposes target attention into summarization and candidate attention, allowing scalable handling of up to one million user history items.

Findings

01

VISTA achieves significant offline metric improvements.

02

VISTA maintains fixed training and inference costs at large scale.

03

Successfully deployed in a billion-user industry platform.

Abstract

Modern large-scale recommendation systems rely heavily on user interaction history sequences to enhance the model performance. The advent of large language models and sequential modeling techniques, particularly transformer-like architectures, has led to significant advancements recently (e.g., HSTU, SIM, and TWIN models). While scaling to ultra-long user histories (10k to 100k items) generally improves model performance, it also creates significant challenges on latency, queries per second (QPS) and GPU cost in industry-scale recommendation systems. Existing models do not adequately address these industrial scalability issues. In this paper, we propose a novel two-stage modeling framework, namely VIrtual Sequential Target Attention (VISTA), which decomposes traditional target attention from a candidate item to user history items into two distinct stages: (1) user history summarization…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. This paper is well motivated. The scalabity and efficiency are important for industrial recommendation systems. The core two-stage architecture (UIH summarization and target attention) effectively decouptions the quadratic computation of attention from real-time inference. 2. The proposed quasi-linear attention (QLA) with linear complexity is tailored for recommendation. 3. The results are backed by comprehensive offline and online A/B tests on a massive industrial-scale dataset, showing sign

Weaknesses

On public datasets like Amazon and KuaiRand, the performance gains are marginal compared to baselines. This suggests the primary advantage of VISTA is strictly in the extreme-scale, ultra-long sequence regime of proprietary industrial data, limiting generalizability.

Reviewer 02Rating 4Confidence 3

Strengths

- Interesting design. VISTA decouples UIH processing into offline summarization and online attention. This design enables handling UIH up to 1M items while keeping inference costs fixed—critical for industrial systems serving billions of users with lifelong interaction histories. - QLA resolves linear attention’s limited expressiveness by integrating SiLU non-linearity and self-target attention. - Practical Industrial Deployment Design: VISTA includes a distributed embedding delivery system t

Weaknesses

- Reliance on Seed Embedding Quality: The summarization stage’s performance hinges on virtual seed embeddings—experiments show increasing seeds from 64 to 128 improves NE by 0.04–0.12% but raises storage costs exponentially - Limited Analysis of Reconstruction Loss: The generative reconstruction loss is claimed to enhance information retention, but its contribution is weakly validated. Ablation shows the loss has marginal value. - Brittleness in Short UIH Scenarios: On public datasets with sho

Reviewer 03Rating 6Confidence 3

Strengths

The paper tackles a practically significant challenge in large-scale recommender systems: how to model ultra-long user behavior sequences without overwhelming computational resources efficiently. The idea of decoupling offline summarization and online attention is pragmatic and fits well within production serving architectures. The framework appears to strike a reasonable balance between modeling capacity and latency, and its design aligns with industrial constraints. The experiments cover multi

Weaknesses

W1: The paper presents various performance comparisons, but fails to provide any statistical significance tests for these results, unclear if the differences are due to random seeds; supplementary tests (e.g., a marginal +0.002 AUC gain in Table 1) and related metrics are needed to validate the gains. W2: While the paper reports metrics like CTR improvements from online A/B tests, the authors did not mention how computational efficacy is in this scenario, which is essential for assessing the re

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.