TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs

Yutao Xie; Nathaniel Thomas; Nicklas Hansen; Yang Fu; Li Erran Li; Xiaolong Wang

arXiv:2603.22293·cs.CL·March 25, 2026

TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs

Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Li Erran Li, Xiaolong Wang

PDF

Open Access 3 Reviews

TL;DR

TIPS introduces a turn-level reward shaping method for search-augmented LLMs, significantly improving training stability and performance on QA benchmarks by providing dense, fine-grained rewards based on increased answer likelihood.

Contribution

The paper proposes TIPS, a novel turn-level reward shaping framework that enhances reinforcement learning for LLMs by addressing sparse rewards and credit assignment issues.

Findings

01

TIPS outperforms PPO baselines on seven QA benchmarks.

02

TIPS improves Exact Match scores by 11.8% and F1 by 13.6% on average.

03

Training stability is substantially enhanced with TIPS.

Abstract

Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. The motivation of this paper is sound: it aims to introduce denser reward signals to improve RL training. 2. The experimental results show notable improvements.

Weaknesses

1. The writing quality needs improvement: Section 3 is hard to follow, and I could not find any mention of what exact model is as the teacher model (if I did not miss anything). 2. The experimental setup seems outdated. Why not evaluate on GAIA or BrowseComp for search LLM? Likewise, why stick with the Qwen2.5 series, which is barely capable of search, instead of building upon the latest Qwen3 or other up-to-date models? 3. The proposed method is heavily tied to a teacher model. A fairer basel

Reviewer 02Rating 6Confidence 3

Strengths

- Built on potential-based reward shaping, TIPS ensures that policy invariance is maintained while providing denser feedback signals, addressing a fundamental limitation in sparse-reward reinforcement learning for language models. - The paper presents a well-structured pipeline—multi-turn reasoning, retrieval, teacher evaluation, and potential-based shaping—supported by consistent mathematical logic and implementation clarity. - Evaluations across seven QA benchmarks and two model scales demonst

Weaknesses

- The reward signal is fully determined by the teacher model’s likelihood estimates. If the teacher is miscalibrated or biased, the shaping signal may misrepresent information gain. No calibration analysis or correction mechanism is discussed. - All experiments use the same teacher model (Qwen-2.5), differing only in whether it is fixed or periodically refreshed. The paper does not evaluate how the reward behaves with different teachers, leaving the robustness of TIPS to teacher variation untest

Reviewer 03Rating 4Confidence 2

Strengths

Originality: The core idea of using a teacher model's likelihood of the correct answer to compute information-gain rewards is highly original. It provides a principled and automated way to generate dense supervision, distinct from heuristic rules or learned reward models. Quality: The work is of very high quality. The combination of a solid theoretical grounding (PBRS) with extensive and carefully designed empirical validation is commendable. The ablations and analysis sections are particularly

Weaknesses

The weaknesses are minor and do not detract from the overall excellent contribution. 1. Computational Overhead: While not explicitly quantified, using a teacher model (especially a 7B model) to compute log-likelihoods for every turn during training introduces non-trivial computational overhead compared to outcome-only rewards. A brief discussion of this cost (e.g., estimated % increase in training time or FLOPs) would be helpful for practitioners. 2. Teacher-Student Capacity: The method assume

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications