The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers
R\'obert Csord\'as, Kazuki Irie, J\"urgen Schmidhuber

TL;DR
This paper shows that simple adjustments to transformer configurations, such as embedding scaling and positional embeddings, significantly enhance their systematic generalization across various datasets, highlighting the importance of proper validation sets.
Contribution
The study demonstrates that basic model configuration tweaks can drastically improve transformer systematic generalization, emphasizing the need for proper validation strategies.
Findings
Transformers' performance improves from 50% to 85% on PCFG.
Relative positional embedding achieves 100% accuracy on SCAN length split.
Model improvements are often invisible on IID data splits.
Abstract
Recently, many datasets have been proposed to test the systematic generalization ability of neural networks. The companion baseline Transformers, typically trained with default hyper-parameters from standard tasks, are shown to fail dramatically. Here we demonstrate that by revisiting model configurations as basic as scaling of embeddings, early stopping, relative positional embedding, and Universal Transformer variants, we can drastically improve the performance of Transformers on systematic generalization. We report improvements on five popular datasets: SCAN, CFQ, PCFG, COGS, and Mathematics dataset. Our models improve accuracy from 50% to 85% on the PCFG productivity split, and from 35% to 81% on COGS. On SCAN, relative positional embedding largely mitigates the EOS decision problem (Newman et al., 2020), yielding 100% accuracy on the length split with a cutoff at 26. Importantly,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
[ML News] AI predicts race from X-Ray | Google kills HealthStreams | Boosting Search with MuZero· youtube
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsAttention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Dropout · Adam · Byte Pair Encoding
