On-Policy Optimization with Group Equivalent Preference for Multi-Programming Language Understanding

Haoyuan Wu; Rui Ming; Jilong Gao; Hangyu Zhao; Xueyi Chen; Yikai Yang; Haisheng Zheng; Zhuolun He; Bei Yu

arXiv:2505.12723·cs.CL·December 5, 2025

On-Policy Optimization with Group Equivalent Preference for Multi-Programming Language Understanding

Haoyuan Wu, Rui Ming, Jilong Gao, Hangyu Zhao, Xueyi Chen, Yikai Yang, Haisheng Zheng, Zhuolun He, Bei Yu

PDF

TL;DR

This paper introduces a novel reinforcement learning framework called OORL with Group Equivalent Preference Optimization (GEPO) to enhance large language models' ability to understand and translate code across multiple programming languages, improving their functional comprehension.

Contribution

The paper proposes a new RL framework combining on-policy and off-policy strategies with a preference optimization method based on IR groups, advancing multi-language code understanding in LLMs.

Findings

01

Significant performance improvements on multi-language code benchmarks.

02

Effective transfer of coding proficiency across diverse programming languages.

03

Enhanced recognition of code functionality and relationships between different language implementations.

Abstract

Large language models (LLMs) achieve remarkable performance in code generation tasks. However, a significant performance disparity persists between popular programming languages (e.g., Python, C++) and others. To address this capability gap, we leverage the code translation task to train LLMs, thereby facilitating the transfer of coding proficiency across diverse programming languages. Moreover, we introduce OORL for training, a novel reinforcement learning (RL) framework that integrates on-policy and off-policy strategies. Within OORL, on-policy RL is applied during code translation, guided by a rule-based reward signal derived from unit tests. Complementing this coarse-grained rule-based reward, we propose Group Equivalent Preference Optimization (GEPO), a novel preference optimization method. Specifically, GEPO trains the LLM using intermediate representations (IRs) groups. LLMs can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.