BASE Layers: Simplifying Training of Large, Sparse Models

Mike Lewis; Shruti Bhosale; Tim Dettmers; Naman Goyal; Luke; Zettlemoyer

arXiv:2103.16716·cs.CL·April 1, 2021·63 cites

BASE Layers: Simplifying Training of Large, Sparse Models

Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, Luke, Zettlemoyer

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents BASE layers, a novel approach for large sparse models that formulates token-to-expert routing as a linear assignment problem, ensuring balanced expert utilization without extra hyperparameters.

Contribution

The paper introduces a balanced assignment scheme for sparse layers in large models, simplifying training and improving efficiency by avoiding auxiliary balancing losses.

Findings

01

Achieves balanced expert utilization without additional hyperparameters.

02

Simplifies training of sparse models by using linear assignment for routing.

03

Code is publicly available for reproducibility.

Abstract

We introduce a new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers. Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules that contain only a small fraction of the model parameters. However, it can be difficult to learn balanced routing functions that make full use of the available experts; existing approaches typically use routing heuristics or auxiliary expert-balancing loss functions. In contrast, we formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens. This optimal assignment scheme improves efficiency by guaranteeing balanced compute loads, and also simplifies training by not requiring any new hyperparameters or auxiliary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pytorch/fairseq
pytorchOfficial

Videos

BASE Layers: Simplifying Training of Large, Sparse Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms