Regularizing Transformers With Deep Probabilistic Layers

Aurora Cobo Aguilera; Pablo Mart\'inez Olmos; Antonio; Art\'es-Rodr\'iguez; Fernando P\'erez-Cruz

arXiv:2108.10764·cs.CL·August 25, 2021

Regularizing Transformers With Deep Probabilistic Layers

Aurora Cobo Aguilera, Pablo Mart\'inez Olmos, Antonio, Art\'es-Rodr\'iguez, Fernando P\'erez-Cruz

PDF

Open Access

TL;DR

This paper introduces a novel regularization method for Transformers by integrating deep probabilistic layers, specifically GMVAE, which enhances their ability to handle noisy data and improves language modeling metrics.

Contribution

It presents the first application of GMVAE as a regularizer in Transformer-based language models, improving their robustness and performance.

Findings

01

GMVAE regularization improves BLEU scores.

02

Enhanced ability to impute missing or noisy words.

03

Effective in both encoder-only and encoder-decoder models.

Abstract

Language models (LM) have grown with non-stop in the last decade, from sequence-to-sequence architectures to the state-of-the-art and utter attention-based Transformers. In this work, we demonstrate how the inclusion of deep generative models within BERT can bring more versatile models, able to impute missing/noisy words with richer text or even improve BLEU score. More precisely, we use a Gaussian Mixture Variational Autoencoder (GMVAE) as a regularizer layer and prove its effectiveness not only in Transformers but also in the most relevant encoder-decoder based LM, seq2seq with and without attention.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · WordPiece · Layer Normalization · Adam · Residual Connection · Weight Decay · Linear Warmup With Linear Decay · Attention Dropout