Hard Negative Sample-Augmented DPO Post-Training for Small Language Models
Haocheng Lu, Minjun Zhu, Henry Yu

TL;DR
This paper introduces a lightweight post-training method for small language models that uses a MathVerifier to identify and emphasize structurally flawed solutions, improving mathematical reasoning without large reward models.
Contribution
It proposes a verifier-guided weighted DPO approach that targets structured errors in mathematical solutions, enhancing model performance efficiently.
Findings
Verifier-guided weighted DPO improves logical and numerical reasoning.
Method outperforms vanilla SFT and unweighted DPO on near-correct solutions.
Approach avoids expensive reward model training.
Abstract
Large language models (LLMs) continue to struggle with mathematical reasoning, and common post-training pipelines often reduce each generated solution to a binary outcome: correct or incorrect. This perspective is limiting in practice, as failures in chain-of-thought (CoT) reasoning are frequently structured; solutions may appear convincing while containing subtle logical, algebraic, or numerical flaws. Meanwhile, reinforcement learning from human feedback (RLHF) variants that rely on large reward models or LLM-as-a-judge signals are often expensive, difficult to scale, and unstable to iterate. We propose a lightweight and pragmatic post-training pipeline that targets such structured errors under realistic compute budgets. Starting from supervised fine-tuning (SFT) on MetaMathQA-style CoT data, we introduce a compact MathVerifier that decomposes a candidate solution into a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
