Distilling Knowledge Learned in BERT for Text Generation

Yen-Chun Chen; Zhe Gan; Yu Cheng; Jingzhou Liu; Jingjing Liu

arXiv:1911.03829·cs.CL·July 21, 2020·21 cites

Distilling Knowledge Learned in BERT for Text Generation

Yen-Chun Chen, Zhe Gan, Yu Cheng, Jingzhou Liu, Jingjing Liu

PDF

Open Access 2 Repos

TL;DR

This paper introduces a novel method called Conditional Masked Language Modeling to distill knowledge from BERT for improved text generation, achieving state-of-the-art results on translation and summarization tasks.

Contribution

The paper proposes a new distillation approach that leverages BERT's bidirectional features to enhance autoregressive sequence models for text generation.

Findings

01

Significant improvements over Transformer baselines.

02

State-of-the-art results on IWSLT German-English translation.

03

Effective global sequence-level supervision for coherent text generation.

Abstract

Large-scale pre-trained language model such as BERT has achieved great success in language understanding tasks. However, it remains an open question how to utilize BERT for language generation. In this paper, we present a novel approach, Conditional Masked Language Modeling (C-MLM), to enable the finetuning of BERT on target generation tasks. The finetuned BERT (teacher) is exploited as extra supervision to improve conventional Seq2Seq models (student) for better text generation performance. By leveraging BERT's idiosyncratic bidirectional nature, distilling knowledge learned in BERT can encourage auto-regressive Seq2Seq models to plan ahead, imposing global sequence-level supervision for coherent text generation. Experiments show that the proposed approach significantly outperforms strong Transformer baselines on multiple language generation tasks such as machine translation and text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Sigmoid Activation · Tanh Activation · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Byte Pair Encoding