Character-level Transformer-based Neural Machine Translation

Nikolay Banar; Walter Daelemans; Mike Kestemont

arXiv:2005.11239·cs.CL·May 25, 2020

Character-level Transformer-based Neural Machine Translation

Nikolay Banar, Walter Daelemans, Mike Kestemont

PDF

TL;DR

This paper introduces a novel character-level Transformer-based neural machine translation model that is faster to train and achieves comparable or better translation quality than existing character and subword-level models across multiple language pairs.

Contribution

The paper presents a new Transformer architecture for character-level NMT that is more efficient and competitive with subword models, and provides comprehensive evaluation and open-source code.

Findings

01

The proposed model is 34% faster to train than previous character-level Transformers.

02

It outperforms subword-level models in FI-EN translation.

03

It achieves comparable results to character-level Transformers in speed and quality.

Abstract

Neural machine translation (NMT) is nowadays commonly applied at the subword level, using byte-pair encoding. A promising alternative approach focuses on character-level translation, which simplifies processing pipelines in NMT considerably. This approach, however, must consider relatively longer sequences, rendering the training process prohibitively expensive. In this paper, we discuss a novel, Transformer-based approach, that we compare, both in speed and in quality to the Transformer at subword and character levels, as well as previously developed character-level models. We evaluate our models on 4 language pairs from WMT'15: DE-EN, CS-EN, FI-EN and RU-EN. The proposed novel architecture can be trained on a single GPU and is 34% percent faster than the character-level Transformer; still, the obtained results are at least on par with it. In addition, our proposed model outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout