Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching
Abdalrahman Wael

TL;DR
This study compares dense and mixture-of-experts transformers at tiny scale, focusing on active versus total parameter matching, revealing MoE's advantage under active matching but not surpassing dense models at equal total capacity.
Contribution
It provides a detailed comparison of dense and MoE transformers at tiny scale, emphasizing the importance of parameter matching strategies.
Findings
MoE models outperform dense under active-parameter matching.
Dense models slightly better under total-parameter matching.
Active-parameter matching favors MoE's validation loss improvements.
Abstract
We study dense and mixture-of-experts (MoE) transformers in a tiny-scale pretraining regime under a shared LLaMA-style decoder training recipe. The sparse model replaces dense feed-forward blocks with Mixtral-style routed experts. Dense baselines are modestly width-resized to tightly match either active or total parameter budgets, while tokenizer, data, optimizer, schedule, depth, context length, normalization style, and evaluation protocol are held fixed. Our best sparse recipe uses four experts, top-2 routing, Switch-style load balancing, and router z-loss. In a three-seed full-data comparison, the dense active-match model reaches 1.6545 +/- 0.0012 best validation loss, the MoE reaches 1.5788 +/- 0.0020, and the dense total-match model reaches 1.5608 +/- 0.0025. This yields a matched-active gap of 0.0758 +/- 0.0021 in the MoE's favor and a matched-total gap of 0.0180 +/- 0.0020 in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
