Hard Negative Sample-Augmented DPO Post-Training for Small Language Models

Haocheng Lu; Minjun Zhu; Henry Yu

arXiv:2512.19728·cs.LG·April 15, 2026

Hard Negative Sample-Augmented DPO Post-Training for Small Language Models

Haocheng Lu, Minjun Zhu, Henry Yu

PDF

TL;DR

This paper introduces a lightweight post-training method for small language models that uses a MathVerifier to identify and emphasize structurally flawed solutions, improving mathematical reasoning without large reward models.

Contribution

It proposes a verifier-guided weighted DPO approach that targets structured errors in mathematical solutions, enhancing model performance efficiently.

Findings

01

Verifier-guided weighted DPO improves logical and numerical reasoning.

02

Method outperforms vanilla SFT and unweighted DPO on near-correct solutions.

03

Approach avoids expensive reward model training.

Abstract

Large language models (LLMs) continue to struggle with mathematical reasoning, and common post-training pipelines often reduce each generated solution to a binary outcome: correct or incorrect. This perspective is limiting in practice, as failures in chain-of-thought (CoT) reasoning are frequently structured; solutions may appear convincing while containing subtle logical, algebraic, or numerical flaws. Meanwhile, reinforcement learning from human feedback (RLHF) variants that rely on large reward models or LLM-as-a-judge signals are often expensive, difficult to scale, and unstable to iterate. We propose a lightweight and pragmatic post-training pipeline that targets such structured errors under realistic compute budgets. Starting from supervised fine-tuning (SFT) on MetaMathQA-style CoT data, we introduce a compact MathVerifier that decomposes a candidate solution into a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.