CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning

Huimu Yu; Xing Wu; Haotian Xu; Debing Zhang; Songlin Hu

arXiv:2410.02229·cs.AI·May 30, 2025

CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning

Huimu Yu, Xing Wu, Haotian Xu, Debing Zhang, Songlin Hu

PDF

Open Access 3 Reviews

TL;DR

CodePMP introduces a scalable pretraining pipeline for preference models using synthesized code-preference pairs, significantly enhancing reasoning capabilities of large language models across multiple benchmarks.

Contribution

The paper presents a novel scalable preference model pretraining method that leverages synthesized code-preference data to improve reward model finetuning for reasoning tasks.

Findings

01

Improved reasoning performance on GSM8K and MATH benchmarks.

02

Enhanced logical reasoning on ReClor and LogiQA2.0 datasets.

03

Scalable preference pretraining boosts reward modeling efficiency.

Abstract

Large language models (LLMs) have made significant progress in natural language understanding and generation, driven by scalable pretraining and advanced finetuning. However, enhancing reasoning abilities in LLMs, particularly via reinforcement learning from human feedback (RLHF), remains challenging due to the scarcity of high-quality preference data, which is labor-intensive to annotate and crucial for reward model (RM) finetuning. To alleviate this issue, we introduce CodePMP, a scalable preference model pretraining (PMP) pipeline that utilizes a large corpus of synthesized code-preference pairs from publicly available high-quality source code. CodePMP improves RM finetuning efficiency by pretraining preference models on large-scale synthesized code-preference pairs. We evaluate CodePMP on mathematical reasoning tasks (GSM8K, MATH) and logical reasoning tasks (ReClor, LogiQA2.0),…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

- The proposed method is highly scalable, allowing it to create 28 million preferred and rejected response pairs. - The reward model can improve two different reasoning tasks (math reasoning and logic reasoning).

Weaknesses

- CodePMP creates preference pairs for coding tasks, but coding tasks are not evaluated in experiments. - The CodePMP data is constructed by deepseek-coder-instruct. It would be interesting to see whether CodePMP can further improve deepseek-coder-instruct on coding tasks. - The reward model is initialized using Qwen2 models (Qwen2-1.5B and Qwen2-7B), which are more capable in math reasoning than the math generator MetaMath-Mistral-7B in Section 4.1.3. A more meaningful setting would be to ex

Reviewer 02Rating 5Confidence 4

Strengths

1. This work proposes an interesting idea of preference pre-training. 2. This method is automated, reducing the dependency on manually annotated preference data. 3. With best-of-N strategy, CodePMP could improve LLM's reasoning performance.

Weaknesses

1. Unfortunately, the effectiveness of the RM was not fully verified, e.g. by using the RM for RFT or PPO training. 2. It remains unclear whether and how code training could help reasoning tasks in natural language. It would be great if the authors could have explored more on relationship between coding and reasoning tasks for model training. 3. No mention of whether the dataset will be open sourced. 4. The models used in data construction are limited. It would be helpful to verify the generaliz

Reviewer 03Rating 5Confidence 3

Strengths

1. A pipeline that create synthesize code-preference pairs is introduce in this work. It can help to solve the scarcity of high-quality preference data if it is working well. 2. Large improvements are achieved on several reasoning tasks (GSM8K, MATH, ReClor, LogiQA2.0) with these synthesize code-preference pairs data. 3. Some details of CodePMP are shown. These can be helpful to the community.

Weaknesses

1. The experiments details are missing or confused. It is better to clarify in the next version. Please check the following Questions section for more details. 2. In the experiments, only two Qwen2 models are used for the evaluation. Other model family results can be used for the verification of the methods.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling