Integrated Training for Sequence-to-Sequence Models Using   Non-Autoregressive Transformer

Evgeniia Tokarchuk; Jan Rosendahl; Weiyue Wang; Pavel Petrushkov,; Tomer Lancewicki; Shahram Khadivi; Hermann Ney

arXiv:2109.12950·cs.CL·September 28, 2021

Integrated Training for Sequence-to-Sequence Models Using Non-Autoregressive Transformer

Evgeniia Tokarchuk, Jan Rosendahl, Weiyue Wang, Pavel Petrushkov,, Tomer Lancewicki, Shahram Khadivi, Hermann Ney

PDF

TL;DR

This paper introduces a non-autoregressive Transformer-based cascaded model that enables end-to-end training for sequence-to-sequence tasks, reducing error propagation and improving translation quality.

Contribution

It proposes a novel architecture that allows end-to-end training without explicit intermediate representations, addressing key issues in traditional cascaded models.

Findings

01

Over 2 BLEU improvement on French-German translation

02

Effective end-to-end training without intermediate data

03

Reduces error propagation in cascaded models

Abstract

Complex natural language applications such as speech translation or pivot translation traditionally rely on cascaded models. However, cascaded models are known to be prone to error propagation and model discrepancy problems. Furthermore, there is no possibility of using end-to-end training data in conventional cascaded systems, meaning that the training data most suited for the task cannot be used. Previous studies suggested several approaches for integrated end-to-end training to overcome those problems, however they mostly rely on (synthetic or natural) three-way data. We propose a cascaded model based on the non-autoregressive Transformer that enables end-to-end training without the need for an explicit intermediate representation. This new architecture (i) avoids unnecessary early decisions that can cause errors which are then propagated throughout the cascaded models and (ii)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Dense Connections · Byte Pair Encoding · Label Smoothing