ReNCE: Learning to Reason by Noise Contrastive Estimation
Wenzheng Zhang, Karl Stratos

TL;DR
This paper introduces ReNCE, a contrastive learning method for enhancing reasoning in pretrained language models, offering a more straightforward alternative to advantage estimation techniques like GRPO.
Contribution
ReNCE presents a novel explicit contrastive learning framework for LLM reasoning, simplifying the training process compared to advantage-based methods.
Findings
ReNCE achieves competitive results on math benchmarks.
It outperforms some existing methods like DAPO and online DPO.
The approach simplifies training by avoiding advantage estimation.
Abstract
GRPO is a standard approach to endowing pretrained LLMs with reasoning capabilities. It estimates the advantage of an outcome from a group of outcomes, and promotes those with positive advantages inside a trust region. Since GRPO discriminates between good and bad outcomes softly, it benefits from additional refinements such as asymmetric clipping and zero-variance data filtering. While effective, these refinements require significant empirical insight and can be challenging to identify. We instead propose an explicit contrastive learning approach. Instead of estimating advantages, we bifurcate outcomes into positive and negative sets, then maximize the likelihood of positive outcomes. Our approach can be viewed as an online instantiation of (multi-label) noise contrastive estimation for LLM reasoning. We validate our method by demonstrating competitive performance on a suite of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning and Data Classification · Advanced Graph Neural Networks
