Regularizing Transformers With Deep Probabilistic Layers
Aurora Cobo Aguilera, Pablo Mart\'inez Olmos, Antonio, Art\'es-Rodr\'iguez, Fernando P\'erez-Cruz

TL;DR
This paper introduces a novel regularization method for Transformers by integrating deep probabilistic layers, specifically GMVAE, which enhances their ability to handle noisy data and improves language modeling metrics.
Contribution
It presents the first application of GMVAE as a regularizer in Transformer-based language models, improving their robustness and performance.
Findings
GMVAE regularization improves BLEU scores.
Enhanced ability to impute missing or noisy words.
Effective in both encoder-only and encoder-decoder models.
Abstract
Language models (LM) have grown with non-stop in the last decade, from sequence-to-sequence architectures to the state-of-the-art and utter attention-based Transformers. In this work, we demonstrate how the inclusion of deep generative models within BERT can bring more versatile models, able to impute missing/noisy words with richer text or even improve BLEU score. More precisely, we use a Gaussian Mixture Variational Autoencoder (GMVAE) as a regularizer layer and prove its effectiveness not only in Transformers but also in the most relevant encoder-decoder based LM, seq2seq with and without attention.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · WordPiece · Layer Normalization · Adam · Residual Connection · Weight Decay · Linear Warmup With Linear Decay · Attention Dropout
