TL;DR
SCRIPT is a module that injects subcharacter compositional knowledge into Korean language models, improving their understanding of morphological and phonological structures without changing architecture.
Contribution
It introduces a model-agnostic module that enhances Korean PLMs with subcharacter structural information, leading to better linguistic and task performance.
Findings
Enhances Korean PLMs across NLU and NLG tasks.
Reshapes embedding space to better capture grammatical regularities.
Achieves performance gains without architectural changes.
Abstract
Korean is a morphologically rich language with a featural writing system in which each character is systematically composed of subcharacter units known as Jamo. These subcharacters not only determine the visual structure of Korean but also encode frequent and linguistically meaningful morphophonological processes. However, most current Korean language models (LMs) are based on subword tokenization schemes, which are not explicitly designed to capture the internal compositional structure of characters. To address this limitation, we propose SCRIPT, a model-agnostic module that injects subcharacter compositional knowledge into Korean PLMs. SCRIPT allows to enhance subword embeddings with structural granularity, without requiring architectural changes or additional pre-training. As a result, SCRIPT enhances all baselines across various Korean natural language understanding (NLU) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
