L$^3$: Large Lookup Layers

Albert Tseng; Christopher De Sa

arXiv:2601.21461·cs.LG·February 2, 2026

L$^3$: Large Lookup Layers

Albert Tseng, Christopher De Sa

PDF

Open Access

TL;DR

L$^3$ introduces a new sparse embedding layer for transformers that uses static token-based routing, improving efficiency and performance over traditional dense and MoE models in language tasks.

Contribution

The paper presents L$^3$, a novel large lookup layer that generalizes embedding tables with static routing, enabling efficient, context-dependent token representations in large language models.

Findings

01

L$^3$ outperforms dense models in language modeling tasks.

02

L$^3$ surpasses MoE models in downstream task performance.

03

The architecture enables fast training and inference with no overhead.

Abstract

Modern sparse language models typically achieve sparsity through Mixture-of-Experts (MoE) layers, which dynamically route tokens to dense MLP "experts." However, dynamic hard routing has a number of drawbacks, such as potentially poor hardware efficiency and needing auxiliary losses for stable training. In contrast, the tokenizer embedding table, which is natively sparse, largely avoids these issues by selecting a single embedding per token at the cost of not having contextual information. In this work, we introduce the Large Lookup Layer (L $^{3}$ ), which unlocks a new axis of sparsity by generalizing embedding tables to model decoder layers. L $^{3}$ layers use static token-based routing to aggregate a set of learned embeddings per token in a context-dependent way, allowing the model to efficiently balance memory and compute by caching information in embeddings. L $^{3}$ has two main…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Graph Neural Networks · Stochastic Gradient Optimization Techniques