Conformal Data Contamination Tests for Trading or Sharing of Data
Martin V. Vejling, Shashi Raj Pandey, Christophe A. N. Biscio, Petar Popovski

TL;DR
This paper introduces a distribution-free, conformal testing framework to identify contaminated external data, ensuring quality guarantees for data sharing in machine learning without relying on distributional assumptions.
Contribution
It proposes novel conformal data contamination tests that provide rigorous, distribution-free quality guarantees for external data in collaborative learning scenarios.
Findings
Tests remain valid under arbitrary contamination levels
Effective in diverse collaborative learning scenarios
Enables false discovery rate control
Abstract
The amount of quality data in many machine learning tasks is limited to what is available locally to data owners. The set of quality data can be expanded through trading or sharing with external data agents. However, data buyers need quality guarantees before purchasing, as external data may be contaminated or irrelevant to their specific learning task. Previous works primarily rely on distributional assumptions about data from different agents, relegating quality checks to post-hoc steps involving costly data valuation procedures. We propose a distribution-free, contamination-aware data-sharing framework that identifies external data agents whose data is most valuable for model personalization. To achieve this, we introduce novel two-sample testing procedures, grounded in rigorous theoretical foundations for conformal outlier detection, to determine whether an agent's data exceeds a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Imbalanced Data Classification Techniques · Ethics and Social Impacts of AI
