A Sub-Character Architecture for Korean Language Processing

Karl Stratos

arXiv:1707.06341·cs.CL·July 24, 2017

A Sub-Character Architecture for Korean Language Processing

Karl Stratos

PDF

1 Repo

TL;DR

This paper presents a novel sub-character architecture for Korean language processing that decomposes characters into phonetic units called jamo, improving accuracy and reducing data sparsity in dependency parsing.

Contribution

The paper introduces a new sub-character architecture leveraging jamo decomposition, enhancing Korean NLP tasks by capturing syntactic and semantic information more effectively.

Findings

01

Reduced observation space to 1.6% of original

02

Significant accuracy improvements in dependency parsing

03

Effective alleviation of data sparsity issues

Abstract

We introduce a novel sub-character architecture that exploits a unique compositional structure of the Korean language. Our method decomposes each character into a small set of primitive phonetic units called jamo letters from which character- and word-level representations are induced. The jamo letters divulge syntactic and semantic information that is difficult to access with conventional character-level units. They greatly alleviate the data sparsity problem, reducing the observation space to 1.6% of the original while increasing accuracy in our experiments. We apply our architecture to dependency parsing and achieve dramatic improvement over strong lexical baselines.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

karlstratos/koreannet
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.