Designing DSIC Mechanisms for Data Sharing in the Era of Large Language Models
Seyed Moein Ayyoubzadeh, Kourosh Shahnazari, Mohammmadali Keshtparvar, MohammadAmin Fazli

TL;DR
This paper develops a mechanism design framework for incentivizing truthful, high-quality data sharing among providers for large language models, ensuring fairness, budget balance, and robustness against misreporting.
Contribution
It introduces the Q-MIA and MUT mechanisms that incentivize truthful data reporting, balance budgets, and incorporate future utility, advancing data market design for LLM training.
Findings
Mechanisms outperform volume-based baselines in data quality and cost efficiency.
Proposed mechanisms are robust to misreporting and collusion.
The framework supports privacy-preserving, verifiable implementation.
Abstract
Training large language models (LLMs) requires vast amounts of high-quality data from institutions that face legal, privacy, and strategic constraints. Existing data procurement methods often rely on unverifiable trust or ignore heterogeneous provider costs. We introduce a mechanism-design framework for truthful, trust-minimized data sharing that ensures dominant-strategy incentive compatibility (DSIC), individual rationality, and weak budget balance, while rewarding data based on both quality and learning utility. We formalize a model where providers privately know their data cost and quality, and value arises solely from the data's contribution to model performance. Based on this, we propose the Quality-Weighted Marginal-Incentive Auction (Q-MIA), which ranks providers using a virtual cost metric and uses Myerson-style payments to ensure DSIC and budget feasibility. To support…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Mobile Crowdsensing and Crowdsourcing · Big Data and Digital Economy
