Integrated Training for Sequence-to-Sequence Models Using Non-Autoregressive Transformer
Evgeniia Tokarchuk, Jan Rosendahl, Weiyue Wang, Pavel Petrushkov,, Tomer Lancewicki, Shahram Khadivi, Hermann Ney

TL;DR
This paper introduces a non-autoregressive Transformer-based cascaded model that enables end-to-end training for sequence-to-sequence tasks, reducing error propagation and improving translation quality.
Contribution
It proposes a novel architecture that allows end-to-end training without explicit intermediate representations, addressing key issues in traditional cascaded models.
Findings
Over 2 BLEU improvement on French-German translation
Effective end-to-end training without intermediate data
Reduces error propagation in cascaded models
Abstract
Complex natural language applications such as speech translation or pivot translation traditionally rely on cascaded models. However, cascaded models are known to be prone to error propagation and model discrepancy problems. Furthermore, there is no possibility of using end-to-end training data in conventional cascaded systems, meaning that the training data most suited for the task cannot be used. Previous studies suggested several approaches for integrated end-to-end training to overcome those problems, however they mostly rely on (synthetic or natural) three-way data. We propose a cascaded model based on the non-autoregressive Transformer that enables end-to-end training without the need for an explicit intermediate representation. This new architecture (i) avoids unnecessary early decisions that can cause errors which are then propagated throughout the cascaded models and (ii)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Dense Connections · Byte Pair Encoding · Label Smoothing
