ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection

Changjiang Gao; Zixian Huang; Kaichen Yang; Jiajun Chen; Jixing Li; Shujian Huang

arXiv:2602.21887·cs.CL·February 26, 2026

ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection

Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li, Shujian Huang

PDF

Open Access

TL;DR

ExpLang introduces a multilingual post-training approach for large reasoning models, enhancing exploration and exploitation during reinforcement learning by dynamically selecting thinking languages, leading to improved performance over English-only training.

Contribution

The paper presents a novel on-policy language selection method during RL training that leverages multilingual thinking to improve reasoning performance in large language models.

Findings

01

Outperforms English-only training with the same budget

02

Achieves high thinking language compliance for seen and unseen languages

03

Extends RL exploration space with diversified language preferences

Abstract

Current large reasoning models (LRMs) have shown strong ability on challenging tasks after reinforcement learning (RL) based post-training. However, previous work mainly focuses on English reasoning in expectation of the strongest performance, despite the demonstrated potential advantage of multilingual thinking, as well as the requirement for native thinking traces by global users. In this paper, we propose ExpLang, a novel LLM post-training pipeline that enables on-policy thinking language selection to improve exploration and exploitation during RL with the use of multiple languages. The results show that our method steadily outperforms English-only training with the same training budget, while showing high thinking language compliance for both seen and unseen languages. Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Topic Modeling