Influence Functions for Preference Dataset Pruning

Daniel Fein; Gabriela Aranguiz-Dias

arXiv:2507.14344·cs.LG·July 22, 2025

Influence Functions for Preference Dataset Pruning

Daniel Fein, Gabriela Aranguiz-Dias

PDF

TL;DR

This paper explores the use of influence functions to identify and prune harmful training examples in preference datasets, improving fine-tuned language model performance.

Contribution

It adapts influence function techniques for dataset pruning in reward model training, demonstrating their effectiveness in enhancing model accuracy.

Findings

01

Influence function filtering improves accuracy by 1.5% after removing 10% of data.

02

Gradient similarity outperforms influence functions in detecting helpful examples.

03

Local curvature is more important for identifying harmful training data.

Abstract

Language models are commonly fine-tuned via reinforcement learning to alter their behavior or elicit new capabilities. Datasets used for these purposes, and particularly human preference datasets, are often noisy. The relatively small size post-training datasets, combined with parameter-efficient fine-tuning methods, enable the use of influence functions approximations to detect and prune training examples that are harmful to performance on a validation set. In this work, we adapt the TL;DR dataset for reward model training to demonstrate how conjugate-gradient approximated influence functions can be used to filter datasets. In our experiments, influence function filtering yields a small retraining accuracy uplift of 1.5% after removing 10% of training examples. We also show that gradient similarity outperforms influence functions for detecting helpful training examples. This suggests…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.