Hash Layers For Large Sparse Models

Stephen Roller; Sainbayar Sukhbaatar; Arthur Szlam; Jason Weston

arXiv:2106.04426·cs.LG·July 21, 2021·48 cites

Hash Layers For Large Sparse Models

Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston

PDF

Open Access 1 Video

TL;DR

This paper introduces a hashing-based sparse layer technique for large Transformer models that outperforms or matches existing methods without complex routing or additional training costs.

Contribution

It proposes a novel hashing approach for sparse layers in Transformers, eliminating the need for routing parameters and complex algorithms, while maintaining competitive performance.

Findings

01

Hashing-based sparse layers outperform or match mixture-of-expert methods.

02

Balanced and local feature-focused hashes work best.

03

Effective on language modeling, dialogue, and downstream tasks.

Abstract

We investigate the training of sparse layers that use different parameters for different inputs based on hashing in large Transformer models. Specifically, we modify the feedforward layer to hash to different sets of weights depending on the current token, over all tokens in the sequence. We show that this procedure either outperforms or is competitive with learning-to-route mixture-of-expert methods such as Switch Transformers and BASE Layers, while requiring no routing parameters or extra terms in the objective function such as a load balancing loss, and no sophisticated assignment algorithm. We study the performance of different hashing techniques, hash sizes and input features, and show that balanced and random hashes focused on the most local features work best, compared to either learning clusters or using longer-range context. We show our approach works well both on large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Hash Layers For Large Sparse Models· slideslive

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Label Smoothing · Residual Connection · Dense Connections · Softmax · Dropout