TL;DR
This paper introduces a two-stage fine-tuning method for Qwen3 14B to enhance its Korean reasoning abilities using supervised learning and reinforcement learning with a novel stability mechanism, achieving superior performance in Korean reasoning tasks.
Contribution
The paper presents a novel two-stage fine-tuning approach incorporating reinforcement learning with an oracle judge to improve Korean reasoning in large language models, addressing stability issues.
Findings
Significant improvement in Korean reasoning benchmarks.
Enhanced problem-solving in math and coding tasks.
Stable reinforcement learning training with the oracle judge.
Abstract
We present a two-stage fine-tuning approach to make the large language model Qwen3 14B "think" natively in Korean. In the first stage, supervised fine-tuning (SFT) on a high-quality Korean reasoning dataset establishes a strong foundation in Korean logical reasoning, yielding notable improvements in Korean-language tasks and even some gains in general reasoning ability. In the second stage, we employ reinforcement learning with a customized Group Relative Policy Optimization (GRPO) algorithm to further enhance both Korean reasoning alignment and overall problem-solving performance. We address critical stability challenges in GRPO training - such as reward hacking and policy collapse - by introducing an oracle judge model that calibrates the reward signal. Our approach achieves stable learning (avoiding the collapse observed in naive GRPO) and leads to steady, incremental performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
