Inference-only sub-character decomposition improves translation of unseen logographic characters
Danielle Saunders, Weston Feely, Bill Byrne

TL;DR
This paper proposes an inference-only sub-character decomposition method to improve translation of unseen logographic characters in neural machine translation, avoiding retraining and enhancing translation quality.
Contribution
It introduces a simple, inference-only approach for sub-character decomposition that improves unseen character translation without retraining models.
Findings
Inference-only decomposition outperforms complete retraining methods.
The approach improves translation adequacy for unseen characters.
It is effective across different language pairs and resource settings.
Abstract
Neural Machine Translation (NMT) on logographic source languages struggles when translating `unseen' characters, which never appear in the training data. One possible approach to this problem uses sub-character decomposition for training and test sentences. However, this approach involves complete retraining, and its effectiveness for unseen character translation to non-logographic languages has not been fully explored. We investigate existing ideograph-based sub-character decomposition approaches for Chinese-to-English and Japanese-to-English NMT, for both high-resource and low-resource domains. For each language pair and domain we construct a test set where all source sentences contain at least one unseen logographic character. We find that complete sub-character decomposition often harms unseen character translation, and gives inconsistent results generally. We offer a simple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
