A First-Order Logic-Based Alternative to Reward Models in RLHF
Chunjin Jian, Xinhua Zhu

TL;DR
This paper introduces a logic-similarity-based reward mechanism for RLHF that replaces traditional reward models, using formal logical consistency to improve alignment of language models with human preferences.
Contribution
It proposes S-GRPO, a supervised variant of GRPO, integrating logical consistency and joint optimization to enhance model alignment and robustness.
Findings
S-GRPO outperforms standard supervised fine-tuning in performance and robustness.
The method extends preference-learning frameworks like GRPO and DPO.
The approach offers a flexible, task-adaptive alignment training method.
Abstract
Reinforcement Learning from Human Feedback (RLHF) plays a crucial role in aligning large language models (LLMs) with human values and preferences. However, the quality and stability of the trained reward model largely determine the final alignment performance. Existing approaches such as Proximal Policy Optimization (PPO) rely heavily on reward models to guide LLMs toward human-aligned behaviors. In this work, we propose a logic-similarity-based reward mechanism as an alternative to conventional reward modeling. Instead of relying on heuristic reward estimation, our method leverages formal logical consistency to steer model alignment with human preferences. Since real-world questions can be interpreted from multiple perspectives, to ensure that logic-based reinforcement learning does not cause model collapse, we introduce S-GRPO, a supervised variant of the GRPO framework. S-GRPO…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Recommender Systems and Techniques
