Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs

Hadi Mohammadi; Tamas Kozak; and Anastasia Giachanou

arXiv:2512.22631·cs.CL·December 30, 2025

Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs

Hadi Mohammadi, Tamas Kozak, and Anastasia Giachanou

PDF

Open Access

TL;DR

This paper evaluates two optimization methods, GRPO and DPO, for enhancing the faithfulness of chain-of-thought reasoning in large language models, finding GRPO generally performs better especially in larger models.

Contribution

The study systematically compares GRPO and DPO in improving the faithfulness of LLMs' reasoning, highlighting GRPO's superior performance and potential for trustworthy AI.

Findings

01

GRPO outperforms DPO in larger models

02

Model size positively correlates with faithfulness improvements

03

GRPO shows greater potential despite less stability at smaller scales

Abstract

Chain-of-thought (CoT) reasoning has emerged as a powerful technique for improving the problem-solving capabilities of large language models (LLMs), particularly for tasks requiring multi-step reasoning. However, recent studies show that CoT explanations often fail to reflect the model's actual reasoning process, as models may produce coherent yet misleading justifications or modify answers without acknowledging external cues. Such discrepancies undermine the reliability of CoT-based methods for safety supervision and alignment monitoring, as models can generate plausible but deceptive rationales for incorrect answers. To better understand this limitation, we evaluate two optimization methods, Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), in their ability to improve CoT faithfulness. Our experiments show that GRPO achieves higher performance than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications