L$^3$: Large Lookup Layers
Albert Tseng, Christopher De Sa

TL;DR
L$^3$ introduces a new sparse embedding layer for transformers that uses static token-based routing, improving efficiency and performance over traditional dense and MoE models in language tasks.
Contribution
The paper presents L$^3$, a novel large lookup layer that generalizes embedding tables with static routing, enabling efficient, context-dependent token representations in large language models.
Findings
L$^3$ outperforms dense models in language modeling tasks.
L$^3$ surpasses MoE models in downstream task performance.
The architecture enables fast training and inference with no overhead.
Abstract
Modern sparse language models typically achieve sparsity through Mixture-of-Experts (MoE) layers, which dynamically route tokens to dense MLP "experts." However, dynamic hard routing has a number of drawbacks, such as potentially poor hardware efficiency and needing auxiliary losses for stable training. In contrast, the tokenizer embedding table, which is natively sparse, largely avoids these issues by selecting a single embedding per token at the cost of not having contextual information. In this work, we introduce the Large Lookup Layer (L), which unlocks a new axis of sparsity by generalizing embedding tables to model decoder layers. L layers use static token-based routing to aggregate a set of learned embeddings per token in a context-dependent way, allowing the model to efficiently balance memory and compute by caching information in embeddings. L has two main…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Graph Neural Networks · Stochastic Gradient Optimization Techniques
