Sequence Length is a Domain: Length-based Overfitting in Transformer Models
Du\v{s}an Vari\v{s}, Ond\v{r}ej Bojar

TL;DR
This paper investigates how length-based overfitting affects Transformer models, revealing that performance drops occur when sequence lengths differ from training data, especially due to overfitting to length distributions rather than input length.
Contribution
The study identifies length distribution mismatch as a key factor in Transformer overfitting and demonstrates its impact on sequence generation tasks.
Findings
Performance drops on out-of-distribution sequence lengths
Overfitting to training length distributions causes generalization issues
Performance is linked to hypothesis length, not input length
Abstract
Transformer-based sequence-to-sequence architectures, while achieving state-of-the-art results on a large number of NLP tasks, can still suffer from overfitting during training. In practice, this is usually countered either by applying regularization methods (e.g. dropout, L2-regularization) or by providing huge amounts of training data. Additionally, Transformer and other architectures are known to struggle when generating very long sequences. For example, in machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches (Koehn and Knowles, 2017). We present results which suggest that the issue might also be in the mismatch between the length distributions of the training and validation data combined with the aforementioned tendency of the neural networks to overfit to the training data. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Label Smoothing · Adam · Residual Connection · Multi-Head Attention · Softmax
