On the Use of Machine Translation-Based Approaches for Vietnamese   Diacritic Restoration

Thai-Hoang Pham; Xuan-Khoai Pham; Phuong Le-Hong

arXiv:1709.07104·cs.CL·October 27, 2017

On the Use of Machine Translation-Based Approaches for Vietnamese Diacritic Restoration

Thai-Hoang Pham, Xuan-Khoai Pham, Phuong Le-Hong

PDF

TL;DR

This study compares phrase-based and neural machine translation methods for Vietnamese diacritic restoration, showing neural methods are faster but slightly less accurate, with potential for future improvements.

Contribution

It is the first to apply neural machine translation to Vietnamese diacritic restoration and provides a comprehensive comparison with the existing phrase-based approach.

Findings

01

Phrase-based approach achieves 97.32% accuracy.

02

Neural-based approach achieves 96.15% accuracy.

03

Neural method is approximately twice as fast in inference.

Abstract

This paper presents an empirical study of two machine translation-based approaches for Vietnamese diacritic restoration problem, including phrase-based and neural-based machine translation models. This is the first work that applies neural-based machine translation method to this problem and gives a thorough comparison to the phrase-based machine translation method which is the current state-of-the-art method for this problem. On a large dataset, the phrase-based approach has an accuracy of 97.32% while that of the neural-based approach is 96.15%. While the neural-based method has a slightly lower accuracy, it is about twice faster than the phrase-based method in terms of inference speed. Moreover, neural-based machine translation method has much room for future improvement such as incorporating pre-trained word embeddings and collecting more training data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.