The Devil is in the Detail: Simple Tricks Improve Systematic   Generalization of Transformers

R\'obert Csord\'as; Kazuki Irie; J\"urgen Schmidhuber

arXiv:2108.12284·cs.LG·February 15, 2022·5 cites

The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers

R\'obert Csord\'as, Kazuki Irie, J\"urgen Schmidhuber

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper shows that simple adjustments to transformer configurations, such as embedding scaling and positional embeddings, significantly enhance their systematic generalization across various datasets, highlighting the importance of proper validation sets.

Contribution

The study demonstrates that basic model configuration tweaks can drastically improve transformer systematic generalization, emphasizing the need for proper validation strategies.

Findings

01

Transformers' performance improves from 50% to 85% on PCFG.

02

Relative positional embedding achieves 100% accuracy on SCAN length split.

03

Model improvements are often invisible on IID data splits.

Abstract

Recently, many datasets have been proposed to test the systematic generalization ability of neural networks. The companion baseline Transformers, typically trained with default hyper-parameters from standard tasks, are shown to fail dramatically. Here we demonstrate that by revisiting model configurations as basic as scaling of embeddings, early stopping, relative positional embedding, and Universal Transformer variants, we can drastically improve the performance of Transformers on systematic generalization. We report improvements on five popular datasets: SCAN, CFQ, PCFG, COGS, and Mathematics dataset. Our models improve accuracy from 50% to 85% on the PCFG productivity split, and from 35% to 81% on COGS. On SCAN, relative positional embedding largely mitigates the EOS decision problem (Newman et al., 2020), yielding 100% accuracy on the length split with a cutoff at 26. Importantly,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

[ML News] AI predicts race from X-Ray | Google kills HealthStreams | Boosting Search with MuZero· youtube

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsAttention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Dropout · Adam · Byte Pair Encoding