TL;DR
This paper introduces a novel approach for character-level Chinese-English translation by encoding Chinese characters using Wubi, enabling effective neural machine translation despite the writing system differences.
Contribution
It proposes using Wubi encoding to adapt character-level NMT models for Chinese, bridging the gap between Chinese and English translation.
Findings
Wubi encoding preserves shape and semantics of Chinese characters.
Wubi-based models perform well at character and subword levels.
Recurrent and convolutional models show promising results.
Abstract
Character-level Neural Machine Translation (NMT) models have recently achieved impressive results on many language pairs. They mainly do well for Indo-European language pairs, where the languages share the same writing system. However, for translating between Chinese and English, the gap between the two different writing systems poses a major challenge because of a lack of systematic correspondence between the individual linguistic units. In this paper, we enable character-level NMT for Chinese, by breaking down Chinese characters into linguistic units similar to that of Indo-European languages. We use the Wubi encoding scheme, which preserves the original shape and semantic information of the characters, while also being reversible. We show promising results from training Wubi-based models on the character- and subword-level with recurrent as well as convolutional models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
