Character-level Transformer-based Neural Machine Translation
Nikolay Banar, Walter Daelemans, Mike Kestemont

TL;DR
This paper introduces a novel character-level Transformer-based neural machine translation model that is faster to train and achieves comparable or better translation quality than existing character and subword-level models across multiple language pairs.
Contribution
The paper presents a new Transformer architecture for character-level NMT that is more efficient and competitive with subword models, and provides comprehensive evaluation and open-source code.
Findings
The proposed model is 34% faster to train than previous character-level Transformers.
It outperforms subword-level models in FI-EN translation.
It achieves comparable results to character-level Transformers in speed and quality.
Abstract
Neural machine translation (NMT) is nowadays commonly applied at the subword level, using byte-pair encoding. A promising alternative approach focuses on character-level translation, which simplifies processing pipelines in NMT considerably. This approach, however, must consider relatively longer sequences, rendering the training process prohibitively expensive. In this paper, we discuss a novel, Transformer-based approach, that we compare, both in speed and in quality to the Transformer at subword and character levels, as well as previously developed character-level models. We evaluate our models on 4 language pairs from WMT'15: DE-EN, CS-EN, FI-EN and RU-EN. The proposed novel architecture can be trained on a single GPU and is 34% percent faster than the character-level Transformer; still, the obtained results are at least on par with it. In addition, our proposed model outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout
