TL;DR
This paper presents a novel sub-character architecture for Korean language processing that decomposes characters into phonetic units called jamo, improving accuracy and reducing data sparsity in dependency parsing.
Contribution
The paper introduces a new sub-character architecture leveraging jamo decomposition, enhancing Korean NLP tasks by capturing syntactic and semantic information more effectively.
Findings
Reduced observation space to 1.6% of original
Significant accuracy improvements in dependency parsing
Effective alleviation of data sparsity issues
Abstract
We introduce a novel sub-character architecture that exploits a unique compositional structure of the Korean language. Our method decomposes each character into a small set of primitive phonetic units called jamo letters from which character- and word-level representations are induced. The jamo letters divulge syntactic and semantic information that is difficult to access with conventional character-level units. They greatly alleviate the data sparsity problem, reducing the observation space to 1.6% of the original while increasing accuracy in our experiments. We apply our architecture to dependency parsing and achieve dramatic improvement over strong lexical baselines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
