Data Valuation for LLM Fine-Tuning: Efficient Shapley Value Approximation via Language Model Arithmetic
M\'elissa Tamine, Otmane Sakhi, Benjamin Heymann

TL;DR
This paper introduces a scalable method for data valuation in large language models by leveraging the mathematical structure of Direct Preference Optimization, enabling efficient Shapley value approximation for better data curation and collaboration.
Contribution
The paper presents a novel approach that simplifies Shapley value computation for LLMs trained with DPO, facilitating practical data valuation.
Findings
Shapley values can be efficiently approximated for DPO-trained LLMs.
The method reduces computational costs significantly compared to traditional approaches.
Enables fair data sharing and investment decisions among multiple data owners.
Abstract
Data is a critical asset for training large language models (LLMs), alongside compute resources and skilled workers. While some training data is publicly available, substantial investment is required to generate proprietary datasets, such as human preference annotations or to curate new ones from existing sources. As larger datasets generally yield better model performance, two natural questions arise. First, how can data owners make informed decisions about curation strategies and data sources investment? Second, how can multiple data owners collaboratively pool their resources to train superior models while fairly distributing the benefits? This problem, data valuation, which is not specific to large language models, has been addressed by the machine learning community through the lens of cooperative game theory, with the Shapley value being the prevalent solution concept. However,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Topic Modeling · Explainable Artificial Intelligence (XAI)
