TL;DR
KOMBO introduces a novel Korean language model that leverages the unique principles of Hangeul character formation, outperforming existing models in NLP tasks by focusing on subcharacter representations.
Contribution
The paper presents a new framework for Korean PLMs that encodes Hangeul's invention principles, emphasizing subcharacter features over subword units.
Findings
Outperforms state-of-the-art Korean PLMs by 2.11% on five NLP tasks.
Demonstrates the effectiveness of subcharacter-based representations for Korean.
Shows improved understanding of Korean linguistic features.
Abstract
The Korean writing system, \textit{Hangeul}, has a unique character representation rigidly following the invention principles recorded in \textit{Hunminjeongeum}.\footnote{\textit{Hunminjeongeum} is a book published in 1446 that describes the principles of invention and usage of \textit{Hangeul}, devised by King Sejong \cite{Hunminjeongeum_Guide}.} However, existing pre-trained language models (PLMs) for Korean have overlooked these principles. In this paper, we introduce a novel framework for Korean PLMs called KOMBO, which firstly brings the invention principles of \textit{Hangeul} to represent character. Our proposed method, KOMBO, exhibits notable experimental proficiency across diverse NLP tasks. In particular, our method outperforms the state-of-the-art Korean PLM by an average of 2.11\% in five Korean natural language understanding tasks. Furthermore, extensive experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
