HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training
Seungho Choi

TL;DR
HanjaBridge enhances Korean language understanding in large language models by integrating Hanja-based semantic disambiguation during pre-training, leading to significant improvements and cross-lingual benefits without extra inference costs.
Contribution
The paper introduces HanjaBridge, a novel Hanja-augmented pre-training method that improves Korean LLMs' semantic disambiguation and cross-lingual transfer capabilities.
Findings
21% relative improvement on KoBALT benchmark
Enhanced semantic alignment between Korean and Chinese
Maintains gains without additional inference cost
Abstract
Large language models (LLMs) often show poor performance in low-resource languages like Korean, partly due to unique linguistic challenges such as homophonous Sino-Korean words that are indistinguishable in Hangul script. To address this semantic ambiguity, we propose HanjaBridge, a novel meaning-injection technique integrated into a continual pre-training (CPT) framework. Instead of deterministically mapping a word to a single Hanja (Chinese character), HanjaBridge presents the model with all possible Hanja candidates for a given homograph, encouraging the model to learn contextual disambiguation. This process is paired with token-level knowledge distillation to prevent catastrophic forgetting. Experimental results show that HanjaBridge significantly improves Korean language understanding, achieving a 21\% relative improvement on the KoBALT benchmark. Notably, by reinforcing semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsKnowledge Distillation
