Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs
Hadi Mohammadi, Tamas Kozak, and Anastasia Giachanou

TL;DR
This paper evaluates two optimization methods, GRPO and DPO, for enhancing the faithfulness of chain-of-thought reasoning in large language models, finding GRPO generally performs better especially in larger models.
Contribution
The study systematically compares GRPO and DPO in improving the faithfulness of LLMs' reasoning, highlighting GRPO's superior performance and potential for trustworthy AI.
Findings
GRPO outperforms DPO in larger models
Model size positively correlates with faithfulness improvements
GRPO shows greater potential despite less stability at smaller scales
Abstract
Chain-of-thought (CoT) reasoning has emerged as a powerful technique for improving the problem-solving capabilities of large language models (LLMs), particularly for tasks requiring multi-step reasoning. However, recent studies show that CoT explanations often fail to reflect the model's actual reasoning process, as models may produce coherent yet misleading justifications or modify answers without acknowledging external cues. Such discrepancies undermine the reliability of CoT-based methods for safety supervision and alignment monitoring, as models can generate plausible but deceptive rationales for incorrect answers. To better understand this limitation, we evaluate two optimization methods, Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), in their ability to improve CoT faithfulness. Our experiments show that GRPO achieves higher performance than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications
