Lessons from Training Grounded LLMs with Verifiable Rewards

Shang Hong Sim; Tej Deep Pala; Vernon Toh; Hai Leong Chieu; Amir Zadeh; Chuan Li; Navonil Majumder; Soujanya Poria

arXiv:2506.15522·cs.CL·June 19, 2025

Lessons from Training Grounded LLMs with Verifiable Rewards

Shang Hong Sim, Tej Deep Pala, Vernon Toh, Hai Leong Chieu, Amir Zadeh, Chuan Li, Navonil Majumder, Soujanya Poria

PDF

Open Access

TL;DR

This paper demonstrates that reinforcement learning with verifiable rewards and internal reasoning significantly improves the grounding, answer correctness, and citation quality of large language models, especially on unanswerable and complex queries.

Contribution

It introduces a two-stage training method using GRPO for outcome-based rewards and combines it with instruction tuning, advancing the reliability of grounded LLM responses.

Findings

01

Models with reasoning and RL outperform instruction-only models.

02

Two-stage training stabilizes learning and improves grounding.

03

Combining GPT-4 distillation with GRPO enhances long-form QA performance.

Abstract

Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs). While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available. In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs. We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations. Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law