Cubit: Token Mixer with Kernel Ridge Regression

Chuanyang Zheng; Jiankai Sun; Yihang Gao; Yuehao Wang; Liangchen Tan; Mac Schwager; Anderson Schneider; Yuriy Nevmyvaka; Xiaodong Liu

arXiv:2605.06501·cs.LG·May 20, 2026

Cubit: Token Mixer with Kernel Ridge Regression

Chuanyang Zheng, Jiankai Sun, Yihang Gao, Yuehao Wang, Liangchen Tan, Mac Schwager, Anderson Schneider, Yuriy Nevmyvaka, Xiaodong Liu

PDF

TL;DR

Cubit introduces a novel token-mixing architecture based on Kernel Ridge Regression, offering a mathematically grounded alternative to Transformer attention with improved long-sequence modeling capabilities.

Contribution

The paper proposes Cubit, a new architecture replacing attention with Kernel Ridge Regression, and introduces the Limited-Range Rescale for training stability.

Findings

01

Cubit shows stronger long-sequence modeling performance.

02

Performance gain increases with training sequence length.

03

Provides a more solid mathematical foundation than traditional Transformer attention.

Abstract

Since its introduction in 2017, the Transformer has become one of the most widely adopted architectures in modern deep learning. Despite extensive efforts to improve positional encoding, attention mechanisms, and feed-forward networks, the core token-mixing mechanism in Transformers remains attention. In this work, we show that the attention module in Transformers can be interpreted as performing Nadaraya-Watson regression, where it computes similarities between tokens and aggregates the corresponding values accordingly. Motivated by this perspective, we propose Cubit, a potential next-generation architecture that leverages Kernel Ridge Regression (KRR), while the vanilla Transformer relies on Nadaraya-Watson regression. Specifically, Cubit modifies the classical attention computation by incorporating the closed-form solution of KRR, combining value aggregation through kernel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.