Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

Zeli Su; Ziyin Zhang; Zhou Liu; Xuexian Song; Zhankai Xu; Longfei Zheng; Xiaolu Zhang; Rong Fu; Guixian Xu; Wentao Zhang

arXiv:2605.14366·cs.CL·May 15, 2026

Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

Zeli Su, Ziyin Zhang, Zhou Liu, Xuexian Song, Zhankai Xu, Longfei Zheng, Xiaolu Zhang, Rong Fu, Guixian Xu, Wentao Zhang

PDF

TL;DR

This paper introduces a semantic-space alignment method using reinforcement learning with semantic rewards, enabling low-resource language expansion in large language models while reducing the alignment tax and preserving general capabilities.

Contribution

The authors propose a novel semantic RL approach with Group Relative Policy Optimization that improves low-resource language tasks without sacrificing pretrained knowledge.

Findings

01

Achieves low-resource language capabilities with less alignment tax.

02

Preserves general competence better than supervised fine-tuning.

03

Yields higher semantic quality and robustness in low-resource settings.

Abstract

Extending large language models (LLMs) to low-resource languages often incurs an "alignment tax": improvements in the target language come at the cost of catastrophic forgetting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions. To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimization (GRPO), where the model is optimized using embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flexible realizations, enabling controlled updates that reduce destructive interference with pretrained knowledge. We evaluate our approach on Tibetan-Chinese machine translation and Tibetan headline generation. Experiments show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.