HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training

Seungho Choi

arXiv:2507.10920·cs.CL·July 16, 2025

HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training

Seungho Choi

PDF

Open Access 1 Video

TL;DR

HanjaBridge enhances Korean language understanding in large language models by integrating Hanja-based semantic disambiguation during pre-training, leading to significant improvements and cross-lingual benefits without extra inference costs.

Contribution

The paper introduces HanjaBridge, a novel Hanja-augmented pre-training method that improves Korean LLMs' semantic disambiguation and cross-lingual transfer capabilities.

Findings

01

21% relative improvement on KoBALT benchmark

02

Enhanced semantic alignment between Korean and Chinese

03

Maintains gains without additional inference cost

Abstract

Large language models (LLMs) often show poor performance in low-resource languages like Korean, partly due to unique linguistic challenges such as homophonous Sino-Korean words that are indistinguishable in Hangul script. To address this semantic ambiguity, we propose HanjaBridge, a novel meaning-injection technique integrated into a continual pre-training (CPT) framework. Instead of deterministically mapping a word to a single Hanja (Chinese character), HanjaBridge presents the model with all possible Hanja candidates for a given homograph, encouraging the model to learn contextual disambiguation. This process is paired with token-level knowledge distillation to prevent catastrophic forgetting. Experimental results show that HanjaBridge significantly improves Korean language understanding, achieving a 21\% relative improvement on the KoBALT benchmark. Notably, by reinforcing semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsKnowledge Distillation