Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training
Dong Shu, Denghui Zhang, Jessica Hullman

TL;DR
This paper introduces Influence-Guided PPO (I-PPO), a method that filters episodes based on influence scores to improve training efficiency and reasoning fidelity in PPO-based language models.
Contribution
The paper proposes a novel influence-based episode filtering technique for PPO, enhancing training speed and reasoning accuracy in language model fine-tuning.
Findings
I-PPO outperforms standard PPO and SFT baselines.
Filtering episodes accelerates training and improves reasoning fidelity.
Influence scores effectively identify unfaithful episodes.
Abstract
Traditional RL algorithms like Proximal Policy Optimization (PPO) typically train on the entire rollout buffer, operating under the assumption that all generated episodes provide a beneficial optimization signal. However, these episodes frequently contain noisy or unfaithful reasoning, which can degrade model performance and slow down training. In this paper, we propose \textbf{Influence-Guided PPO (I-PPO)}, a novel framework that integrates data attribution into the RL post-training loop. By calculating an influence score for each episode using a gradient-based approximation, I-PPO identifies and eliminates episodes that are anti-aligned with a validation gradient. Our experiments demonstrate that I-PPO consistently outperforms SFT and PPO baselines. We show that our filtering process acts as an intrinsic early stopping mechanism, accelerating training efficiency while effectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
