TL;DR
PiCA introduces a pivot-based reward mechanism for search agents in reinforcement learning, effectively addressing long-horizon credit assignment challenges and improving performance on knowledge-intensive tasks.
Contribution
The paper proposes PiCA, a novel pivot-based credit assignment method that enhances reward signals by leveraging success probabilities and historical context, outperforming existing methods.
Findings
PiCA achieves 15.2% and 2.2% improvements on 3B and 7B models.
PiCA outperforms strong baselines across seven QA benchmarks.
PiCA maintains distributional consistency while providing dense, pivot-aware guidance.
Abstract
Large Language Model (LLM)-based search agents trained with reinforcement learning (RL) have significantly improved the performance of knowledge-intensive tasks. However, existing methods encounter critical challenges in long-horizon credit assignment: (i) Reward Sparsity, where models receive only outcome feedback without step-level guidance to differentiate action quality; (ii) Isolated Credit, where credit is assigned to steps independently, failing to capture sequential dependencies; and (iii) Distributional Shift, where rewards are estimated on templates that deviate from the model's natural generative distribution. To address these issues, we propose Pivot-Based Credit Assignment (PiCA), a novel step reward mechanism that reformulates the search trajectory as a sequential process of cumulative search progress. Unlike prior isolated step rewards, PiCA defines process rewards as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
