TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs
Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Li Erran Li, Xiaolong Wang

TL;DR
TIPS introduces a turn-level reward shaping method for search-augmented LLMs, significantly improving training stability and performance on QA benchmarks by providing dense, fine-grained rewards based on increased answer likelihood.
Contribution
The paper proposes TIPS, a novel turn-level reward shaping framework that enhances reinforcement learning for LLMs by addressing sparse rewards and credit assignment issues.
Findings
TIPS outperforms PPO baselines on seven QA benchmarks.
TIPS improves Exact Match scores by 11.8% and F1 by 13.6% on average.
Training stability is substantially enhanced with TIPS.
Abstract
Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For…
Peer Reviews
Decision·ICLR 2026 Poster
1. The motivation of this paper is sound: it aims to introduce denser reward signals to improve RL training. 2. The experimental results show notable improvements.
1. The writing quality needs improvement: Section 3 is hard to follow, and I could not find any mention of what exact model is as the teacher model (if I did not miss anything). 2. The experimental setup seems outdated. Why not evaluate on GAIA or BrowseComp for search LLM? Likewise, why stick with the Qwen2.5 series, which is barely capable of search, instead of building upon the latest Qwen3 or other up-to-date models? 3. The proposed method is heavily tied to a teacher model. A fairer basel
- Built on potential-based reward shaping, TIPS ensures that policy invariance is maintained while providing denser feedback signals, addressing a fundamental limitation in sparse-reward reinforcement learning for language models. - The paper presents a well-structured pipeline—multi-turn reasoning, retrieval, teacher evaluation, and potential-based shaping—supported by consistent mathematical logic and implementation clarity. - Evaluations across seven QA benchmarks and two model scales demonst
- The reward signal is fully determined by the teacher model’s likelihood estimates. If the teacher is miscalibrated or biased, the shaping signal may misrepresent information gain. No calibration analysis or correction mechanism is discussed. - All experiments use the same teacher model (Qwen-2.5), differing only in whether it is fixed or periodically refreshed. The paper does not evaluate how the reward behaves with different teachers, leaving the robustness of TIPS to teacher variation untest
Originality: The core idea of using a teacher model's likelihood of the correct answer to compute information-gain rewards is highly original. It provides a principled and automated way to generate dense supervision, distinct from heuristic rules or learned reward models. Quality: The work is of very high quality. The combination of a solid theoretical grounding (PBRS) with extensive and carefully designed empirical validation is commendable. The ablations and analysis sections are particularly
The weaknesses are minor and do not detract from the overall excellent contribution. 1. Computational Overhead: While not explicitly quantified, using a teacher model (especially a 7B model) to compute log-likelihoods for every turn during training introduces non-trivial computational overhead compared to outcome-only rewards. A brief discussion of this cost (e.g., estimated % increase in training time or FLOPs) would be helpful for practitioners. 2. Teacher-Student Capacity: The method assume
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
