Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks
Jason Phang, Thibault F\'evry, Samuel R. Bowman

TL;DR
Supplementary training on intermediate supervised tasks significantly enhances sentence encoder performance on language understanding benchmarks, especially in data-limited scenarios, by building on pretraining methods like BERT and ELMo.
Contribution
This paper demonstrates that supplementary training on intermediate labeled-data tasks improves the performance of sentence encoders beyond standard pretraining, achieving state-of-the-art results on GLUE.
Findings
Achieved a GLUE score of 81.8 with BERT, surpassing previous state-of-the-art.
Reduced variance across random restarts with supplementary training.
Significant improvements in low-data regimes with supplementary training.
Abstract
Pretraining sentence encoders with language modeling and related unsupervised tasks has recently been shown to be very effective for language understanding tasks. By supplementing language model-style pretraining with further training on data-rich supervised tasks, such as natural language inference, we obtain additional performance improvements on the GLUE benchmark. Applying supplementary training on BERT (Devlin et al., 2018), we attain a GLUE score of 81.8---the state of the art (as of 02/24/2019) and a 1.4 point improvement over BERT. We also observe reduced variance across random restarts in this setting. Our approach yields similar improvements when applied to ELMo (Peters et al., 2018a) and Radford et al. (2018)'s model. In addition, the benefits of supplementary training are particularly pronounced in data-constrained regimes, as we show in experiments with artificially limited…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Sigmoid Activation · Tanh Activation · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam
