Incorporating BERT into Parallel Sequence Decoding with Adapters

Junliang Guo; Zhirui Zhang; Linli Xu; Hao-Ran Wei; Boxing Chen; Enhong; Chen

arXiv:2010.06138·cs.CL·October 14, 2020·40 cites

Incorporating BERT into Parallel Sequence Decoding with Adapters

Junliang Guo, Zhirui Zhang, Linli Xu, Hao-Ran Wei, Boxing Chen, Enhong, Chen

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a flexible, efficient sequence decoding framework that integrates BERT models with adapters for improved neural machine translation, reducing latency and maintaining high translation quality.

Contribution

It proposes a novel parallel decoding approach using BERT with adapters, enabling task-agnostic, plug-in modules that outperform autoregressive baselines in translation tasks.

Findings

01

Outperforms autoregressive baselines in BLEU scores

02

Reduces inference latency by half

03

Achieves state-of-the-art translation quality

Abstract

While large scale pre-trained language models such as BERT have achieved great success on various natural language understanding tasks, how to efficiently and effectively incorporate them into sequence-to-sequence models and the corresponding text generation tasks remains a non-trivial problem. In this paper, we propose to address this problem by taking two different BERT models as the encoder and decoder respectively, and fine-tuning them by introducing simple and lightweight adapter modules, which are inserted between BERT layers and tuned on the task-specific dataset. In this way, we obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models, while bypassing the catastrophic forgetting problem. Each component in the framework can be considered as a plug-in unit, making the framework flexible and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lemmonation/abnet
pytorchOfficial

Videos

Incorporating BERT into Parallel Sequence Decoding with Adapters· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Adam · Softmax · Layer Normalization · Dense Connections · Multi-Head Attention · Dropout · Linear Warmup With Linear Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout