Data Valuation for LLM Fine-Tuning: Efficient Shapley Value Approximation via Language Model Arithmetic

M\'elissa Tamine; Otmane Sakhi; Benjamin Heymann

arXiv:2512.15765·cs.LG·January 27, 2026

Data Valuation for LLM Fine-Tuning: Efficient Shapley Value Approximation via Language Model Arithmetic

M\'elissa Tamine, Otmane Sakhi, Benjamin Heymann

PDF

Open Access

TL;DR

This paper introduces a scalable method for data valuation in large language models by leveraging the mathematical structure of Direct Preference Optimization, enabling efficient Shapley value approximation for better data curation and collaboration.

Contribution

The paper presents a novel approach that simplifies Shapley value computation for LLMs trained with DPO, facilitating practical data valuation.

Findings

01

Shapley values can be efficiently approximated for DPO-trained LLMs.

02

The method reduces computational costs significantly compared to traditional approaches.

03

Enables fair data sharing and investment decisions among multiple data owners.

Abstract

Data is a critical asset for training large language models (LLMs), alongside compute resources and skilled workers. While some training data is publicly available, substantial investment is required to generate proprietary datasets, such as human preference annotations or to curate new ones from existing sources. As larger datasets generally yield better model performance, two natural questions arise. First, how can data owners make informed decisions about curation strategies and data sources investment? Second, how can multiple data owners collaboratively pool their resources to train superior models while fairly distributing the benefits? This problem, data valuation, which is not specific to large language models, has been addressed by the machine learning community through the lens of cooperative game theory, with the Shapley value being the prevalent solution concept. However,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Topic Modeling · Explainable Artificial Intelligence (XAI)