Hash Layers For Large Sparse Models
Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston

TL;DR
This paper introduces a hashing-based sparse layer technique for large Transformer models that outperforms or matches existing methods without complex routing or additional training costs.
Contribution
It proposes a novel hashing approach for sparse layers in Transformers, eliminating the need for routing parameters and complex algorithms, while maintaining competitive performance.
Findings
Hashing-based sparse layers outperform or match mixture-of-expert methods.
Balanced and local feature-focused hashes work best.
Effective on language modeling, dialogue, and downstream tasks.
Abstract
We investigate the training of sparse layers that use different parameters for different inputs based on hashing in large Transformer models. Specifically, we modify the feedforward layer to hash to different sets of weights depending on the current token, over all tokens in the sequence. We show that this procedure either outperforms or is competitive with learning-to-route mixture-of-expert methods such as Switch Transformers and BASE Layers, while requiring no routing parameters or extra terms in the objective function such as a load balancing loss, and no sophisticated assignment algorithm. We study the performance of different hashing techniques, hash sizes and input features, and show that balanced and random hashes focused on the most local features work best, compared to either learning clusters or using longer-range context. We show our approach works well both on large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Label Smoothing · Residual Connection · Dense Connections · Softmax · Dropout
