Sequence Length is a Domain: Length-based Overfitting in Transformer   Models

Du\v{s}an Vari\v{s}; Ond\v{r}ej Bojar

arXiv:2109.07276·cs.CL·January 4, 2022

Sequence Length is a Domain: Length-based Overfitting in Transformer Models

Du\v{s}an Vari\v{s}, Ond\v{r}ej Bojar

PDF

Open Access 1 Repo

TL;DR

This paper investigates how length-based overfitting affects Transformer models, revealing that performance drops occur when sequence lengths differ from training data, especially due to overfitting to length distributions rather than input length.

Contribution

The study identifies length distribution mismatch as a key factor in Transformer overfitting and demonstrates its impact on sequence generation tasks.

Findings

01

Performance drops on out-of-distribution sequence lengths

02

Overfitting to training length distributions causes generalization issues

03

Performance is linked to hypothesis length, not input length

Abstract

Transformer-based sequence-to-sequence architectures, while achieving state-of-the-art results on a large number of NLP tasks, can still suffer from overfitting during training. In practice, this is usually countered either by applying regularization methods (e.g. dropout, L2-regularization) or by providing huge amounts of training data. Additionally, Transformer and other architectures are known to struggle when generating very long sequences. For example, in machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches (Koehn and Knowles, 2017). We present results which suggest that the issue might also be in the mismatch between the length distributions of the training and validation data combined with the aforementioned tendency of the neural networks to overfit to the training data. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LiamMaclean216/Pytorch-Transfomer
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Label Smoothing · Adam · Residual Connection · Multi-Head Attention · Softmax