Breaking the Representation Bottleneck of Chinese Characters: Neural Machine Translation with Stroke Sequence Modeling
Zhijun Wang, Xuebo Liu, Min Zhang

TL;DR
This paper introduces StrokeNet, a novel Chinese character representation using stroke sequences, which enhances neural machine translation by capturing internal features and reducing parameters, leading to improved translation performance.
Contribution
StrokeNet is a new representation method that maps Chinese strokes to Latin characters, enabling better feature sharing and parameter efficiency in NMT models.
Findings
Significant BLEU score improvements on Chinese-English translation tasks.
Fewer model parameters needed for comparable or better performance.
Effective utilization of stroke-based representations in multilingual NMT.
Abstract
Existing research generally treats Chinese character as a minimum unit for representation. However, such Chinese character representation will suffer two bottlenecks: 1) Learning bottleneck, the learning cannot benefit from its rich internal features (e.g., radicals and strokes); and 2) Parameter bottleneck, each individual character has to be represented by a unique vector. In this paper, we introduce a novel representation method for Chinese characters to break the bottlenecks, namely StrokeNet, which represents a Chinese character by a Latinized stroke sequence (e.g., "ao1 (concave)" to "ajaie" and "tu1 (convex)" to "aeaqe"). Specifically, StrokeNet maps each stroke to a specific Latin character, thus allowing similar Chinese characters to have similar Latin representations. With the introduction of StrokeNet to neural machine translation (NMT), many powerful but not applicable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Topic Modeling
