Information-Consistent Language Model Recommendations through Group Relative Policy Optimization
Sonal Prabhune, Balaji Padmanabhan, and Kaushik Dutta

TL;DR
This paper introduces a reinforcement learning method using Group Relative Policy Optimization to enhance the consistency of large language models in delivering invariant information across semantically equivalent prompts, crucial for enterprise applications.
Contribution
It adapts GRPO to enforce information stability in LLMs, introducing entropy-based rewards and prompt grouping, a novel approach for improving consistency in enterprise contexts.
Findings
GRPO-fine-tuned model shows reduced variability compared to baseline.
The approach effectively enforces information stability across prompt groups.
First application of GRPO for aligning LLMs towards information consistency.
Abstract
Large Language Models (LLMs) are increasingly deployed in business-critical domains such as finance, education, healthcare, and customer support, where users expect consistent and reliable recommendations. Yet LLMs often exhibit variability when prompts are phrased with minor differences, even when semantically equivalent. Such inconsistency undermines trust, complicates compliance, and disrupts user experience. While personalization is desirable in certain contexts, many enterprise scenarios, such as HR onboarding, customer support, or policy disclosure, require invariant information delivery regardless of phrasing or prior conversational history. Existing approaches, including retrieval-augmented generation (RAG) and temperature tuning, improve factuality or reduce stochasticity, but cannot guarantee stability across equivalent prompts. In this paper, we propose a reinforcement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
