TL;DR
LIME introduces a novel recommendation model that significantly reduces inference complexity by decoupling user and candidate interactions and employing linear attention, enabling faster and scalable recommendations without sacrificing accuracy.
Contribution
LIME's innovative architecture achieves near-parity with state-of-the-art transformers while drastically reducing computational costs, especially for large candidate sets and long user histories.
Findings
LIME achieves 10x faster inference on large datasets.
LIME maintains competitive recommendation accuracy.
LIME improves user engagement in industrial deployment.
Abstract
Scaling large recommendation systems requires advancing three major frontiers: processing longer user histories, expanding candidate sets, and increasing model capacity. While promising, transformers' computational cost scales quadratically with the user sequence length and linearly with the number of candidates. This trade-off makes it prohibitively expensive to expand candidate sets or increase sequence length at inference, despite the significant performance improvements. We introduce \textbf{LIME}, a novel architecture that resolves this trade-off. Through two key innovations, LIME fundamentally reduces computational complexity. First, low-rank ``link embeddings" enable pre-computation of attention weights by decoupling user and candidate interactions, making the inference cost nearly independent of candidate set size. Second, a linear attention mechanism, \textbf{LIME-XOR},…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- Novel architectural design that effectively resolves the efficiency–expressiveness trade-off in large-scale recommendation. - Clear technical innovations in decoupling attention (LIME-MHA) and introducing efficient XOR masking (LIME-XOR). - Detailed experimental verification across public and industrial datasets, with both offline and online A/B test results provided. - Demonstrated scaling properties and robustness in latency under very large candidate sets and long user histories.
If I haven't misunderstood, the LIME-MHA part actually replaces the attention mechanism with a well designed matrix multiplication between user and candidate parts, manifested as the matrix multiplication of one matrix TL^t (independent of the user's item-side information) and the second one L^P (user-side personalized information). Therefore, this approach will definitely significantly enhance the speed of reasoning. However, based on the results of this article, its performance is also very st
- The paper simultaneously resolves two bottlenecks—candidate set scaling (via pre-computed attention weights) and long user history modeling (via linear XOR attention). - Near-SOTA Accuracy with Low Latency: LIME-XOR closes the performance gap with HSTU across tasks (video completion, watch time) while maintaining latency that is orders of magnitude lower. - Practical Deployment Value: Its decoupled inference pipeline (offline item pre-computation + online user processing) aligns with real-worl
- Link Count Sensitivity: Performance scales with link count, but small (e.g., 8–16) underperforms, and large may offset efficiency gains - Limited Shorter Sequence Advantage: On datasets with short user histories (e.g., Taobao Ads, max length 50), LIME-XOR performs only slightly better than LIME-MHA and lags HSTU. It seems that its linear attention gains are less impactful for small N. - Cold-Start Unaddressed: The framework relies on historical user interactions to personalize link embeddings,
The paper provides a figure immediately after abstract to showcase their empirical results. The paper is written well with fluent narrative. The paper targets the quadratic complexity of MHA, which is a significant problem especially in Recsys. Online AB test results are provided.
According to Figure 1, MHA outperforms the proposed LIME for user sequence lengths $N<4096$, and similarly outperforms LIME-XOR for $N<8192$. This implies that LIME provides limited acceleration benefits within these ranges, which are typical in real-world recommender systems. Since users with longer behavioral sequences (>4096 or 8192) constitute a small fraction, the practical advantage of LIME in terms of acceleration appears limited under realistic conditions. An ablation study is essential
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
