Influence Functions for Preference Dataset Pruning
Daniel Fein, Gabriela Aranguiz-Dias

TL;DR
This paper explores the use of influence functions to identify and prune harmful training examples in preference datasets, improving fine-tuned language model performance.
Contribution
It adapts influence function techniques for dataset pruning in reward model training, demonstrating their effectiveness in enhancing model accuracy.
Findings
Influence function filtering improves accuracy by 1.5% after removing 10% of data.
Gradient similarity outperforms influence functions in detecting helpful examples.
Local curvature is more important for identifying harmful training data.
Abstract
Language models are commonly fine-tuned via reinforcement learning to alter their behavior or elicit new capabilities. Datasets used for these purposes, and particularly human preference datasets, are often noisy. The relatively small size post-training datasets, combined with parameter-efficient fine-tuning methods, enable the use of influence functions approximations to detect and prune training examples that are harmful to performance on a validation set. In this work, we adapt the TL;DR dataset for reward model training to demonstrate how conjugate-gradient approximated influence functions can be used to filter datasets. In our experiments, influence function filtering yields a small retraining accuracy uplift of 1.5% after removing 10% of training examples. We also show that gradient similarity outperforms influence functions for detecting helpful training examples. This suggests…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
