STEM: Scaling Transformers with Embedding Modules
Ranajoy Sadhukhan, Sheng Cao, Harry Dong, Changsheng Zhao, Attiano Purpura-Pontoniere, Yuandong Tian, Zechun Liu, Beidi Chen

TL;DR
STEM introduces a novel static, token-indexed approach to scaling transformers by replacing the FFN up-projection with embedding lookups, leading to improved training stability, efficiency, and interpretability, especially at large scales.
Contribution
STEM proposes a new method that decouples capacity from FLOPs and communication, enabling stable training with extreme sparsity and enhanced interpretability in transformer models.
Findings
Improves downstream performance over dense baselines.
Reduces per-token FLOPs and parameter accesses by about one-third.
Delivers 3-4% accuracy improvements on large-scale benchmarks.
Abstract
Fine-grained sparsity promises higher parametric capacity without proportional per-token compute, but often suffers from training instability, load balancing, and communication overhead. We introduce STEM (Scaling Transformers with Embedding Modules), a static, token-indexed approach that replaces the FFN up-projection with a layer-local embedding lookup while keeping the gate and down-projection dense. This removes runtime routing, enables CPU offload with asynchronous prefetch, and decouples capacity from both per-token FLOPs and cross-device communication. Empirically, STEM trains stably despite extreme sparsity. It improves downstream performance over dense baselines while reducing per-token FLOPs and parameter accesses (eliminating roughly one-third of FFN parameters). STEM learns embedding spaces with large angular spread which enhances its knowledge storage capacity. More…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper proposes an interesting idea to complement the transformer architecture. The authors are aware of the many practical challenges surrounding today’s transformer architecture and take them into account when designing STEM. - The proposed method can be applied in stages by deciding to only apply it to some portion of the layers of a transformer. This offers for a smoother transition away from the standard architecture and for greater flexibility in architecture design. - The paper per
- Although the paper correctly accounts for reduced FLOPs, it can often be quite challenging to realize the full benefit of a FLOP reduction in practice. The paper offers a theoretical discussion of how the embedding table reads can be pre-fetched and offloaded onto CPU which makes sense but it would have made a stronger case if wall-clock training, prefill and generation time comaprisons were provided along with HBM usage. - The concern about the hit to contextual reasoning abilities (since we
1. Novel Architecture Design. A creative and simple static-sparsity alternative to MoE models that avoids routing overhead and load-balancing complications. 2. Strong Empirical Validation. Comprehensive experiments across scales (350M, 1B) and tasks demonstrate consistent benefits in both efficiency and accuracy. 3. Training Stability. Unlike many fine-grained sparse models, STEM avoids loss spikes and under-trained experts.
1. Potential Memory Overhead. Each layer’s token-indexed embedding table may become impractical for large vocabularies, despite CPU offload? 2. Lack of Ablation on Embedding Dimensionality. The effect of embedding size or vocabulary size on performance and stability is unexplored.
STEM simplifies operation compared to MOE by dense up-projection in the SwiGLU FFN with a token-indexed vector from a per-layer table lookup. STEM improves both computation by reducing per-layer FLOPs during training and memory access by lowering parameter traffic relative to a dense up-projection. Experiments are informative, with sufficient ablations. The paper does a good job at describing the various aspects of the STEM technique and in pointing out the benefits of the approach relative t
Paper is dense and focused. It uses a lot of jargon and will not be so accessible to readers not familiar with the ideas and issues specific to this narrow research area. This is not necessarily a weakness, but readers that are not in this area may not appreciate the paper's purpose or contributions. The paper would be better with some diagrams to illustrate the architectural differences between STEM and other approaches, including MOE techniques. Figure 1.c is too small and unclear for this pur
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Ferroelectric and Negative Capacitance Devices · Machine Learning in Materials Science
