TokenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders

Yuchen Jiang; Jie Zhu; Xintian Han; Hui Lu; Kunmin Bai; Mingyu Yang; Shikang Wu; Ruihao Zhang; Wenlin Zhao; Shipeng Bai; Sijin Zhou; Huizhi Yang; Tianyi Liu; Wenda Liu; Ziyan Gong; Haoran Ding; Zheng Chai; Deping Xie; Zhe Chen; Yuchao Zheng; Peng Xu

arXiv:2602.06563·cs.IR·February 11, 2026

TokenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders

Yuchen Jiang, Jie Zhu, Xintian Han, Hui Lu, Kunmin Bai, Mingyu Yang, Shikang Wu, Ruihao Zhang, Wenlin Zhao, Shipeng Bai, Sijin Zhou, Huizhi Yang, Tianyi Liu, Wenda Liu, Ziyan Gong, Haoran Ding, Zheng Chai, Deping Xie, Zhe Chen, Yuchao Zheng, Peng Xu

PDF

Open Access

TL;DR

TokenMixer-Large is a new large-scale recommendation model that overcomes previous limitations in scalability and efficiency, achieving significant online and offline performance improvements in industrial settings.

Contribution

It introduces a novel architecture with mixing-and-reverting operations, residuals, auxiliary loss, and sparse MoE for scalable, efficient recommendation modeling.

Findings

01

Scaled to 7B and 15B parameters with successful deployment

02

Achieved +1.66% in orders and +2.98% in GMV in e-commerce

03

Improved advertising and live streaming revenue metrics

Abstract

While scaling laws for recommendation models have gained significant traction, existing architectures such as Wukong, HiFormer and DHEN, often struggle with sub-optimal designs and hardware under-utilization, limiting their practical scalability. Our previous TokenMixer architecture (introduced in RankMixer paper) addressed effectiveness and efficiency by replacing self-attention with a ightweight token-mixing operator; however, it faced critical bottlenecks in deeper configurations, including sub-optimal residual paths, vanishing gradients, incomplete MoE sparsification and constrained scalability. In this paper, we propose TokenMixer-Large, a systematically evolved architecture designed for extreme-scale recommendation. By introducing a mixing-and-reverting operation, inter-layer residuals and the auxiliary loss, we ensure stable gradient propagation even as model depth increases.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Advanced Graph Neural Networks · Text and Document Classification Technologies