Investigating an Effective Character-level Embedding in Korean Sentence Classification
Won Ik Cho, Seok Min Kim, Nam Soo Kim

TL;DR
This paper compares various character encoding schemes for Korean sentence classification, finding that character-level features generally outperform Jamo-level features, with potential benefits for attention models.
Contribution
It systematically evaluates different Korean character encoding schemes for classification tasks, highlighting the effectiveness of character-level features over Jamo-level features.
Findings
Character-level features outperform Jamo-level features in classification accuracy.
Jamo-level features may benefit attention-based models with sufficient parameters.
Character-rich, agglutinative nature of Korean influences encoding scheme effectiveness.
Abstract
Different from the writing systems of many Romance and Germanic languages, some languages or language families show complex conjunct forms in character composition. For such cases where the conjuncts consist of the components representing consonant(s) and vowel, various character encoding schemes can be adopted beyond merely making up a one-hot vector. However, there has been little work done on intra-language comparison regarding performances using each representation. In this study, utilizing the Korean language which is character-rich and agglutinative, we investigate an encoding scheme that is the most effective among Jamo-level one-hot, character-level one-hot, character-level dense, and character-level multi-hot. Classification performance with each scheme is evaluated on two corpora: one on binary sentiment analysis of movie reviews, and the other on multi-class identification of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Natural Language Processing Techniques · Topic Modeling
