An Evaluation of Neural Machine Translation Models on Historical   Spelling Normalization

Gongbo Tang; Fabienne Cap; Eva Pettersson; Joakim Nivre

arXiv:1806.05210·cs.CL·August 7, 2018·27 cites

An Evaluation of Neural Machine Translation Models on Historical Spelling Normalization

Gongbo Tang, Fabienne Cap, Eva Pettersson, Joakim Nivre

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper evaluates various neural machine translation models for historical spelling normalization across five languages, demonstrating that NMT models outperform SMT, with specific architectures excelling under different data conditions.

Contribution

The study systematically compares multiple NMT architectures and attention mechanisms for spelling normalization, introducing a hybrid method that enhances performance.

Findings

01

NMT models outperform SMT in character error rate.

02

Transformer models need more data to outperform RNNs.

03

Subword models with small vocabularies are better for low-resource languages.

Abstract

In this paper, we apply different NMT models to the problem of historical spelling normalization for five languages: English, German, Hungarian, Icelandic, and Swedish. The NMT models are at different levels, have different attention mechanisms, and different neural network architectures. Our results show that NMT models are much better than SMT models in terms of character error rate. The vanilla RNNs are competitive to GRUs/LSTMs in historical spelling normalization. Transformer models perform better only when provided with more training data. We also find that subword-level models with a small subword vocabulary are better than character-level models for low-resource languages. In addition, we propose a hybrid method which further improves the performance of historical spelling normalization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tanggongbo/normalization-NMT
noneOfficial

Datasets

Kylan12/Synthetic-AI-ML-Dataset
dataset· 42 dl
42 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax