Conformal Data Contamination Tests for Trading or Sharing of Data

Martin V. Vejling; Shashi Raj Pandey; Christophe A. N. Biscio; Petar Popovski

arXiv:2507.13835·stat.ML·July 21, 2025

Conformal Data Contamination Tests for Trading or Sharing of Data

Martin V. Vejling, Shashi Raj Pandey, Christophe A. N. Biscio, Petar Popovski

PDF

Open Access

TL;DR

This paper introduces a distribution-free, conformal testing framework to identify contaminated external data, ensuring quality guarantees for data sharing in machine learning without relying on distributional assumptions.

Contribution

It proposes novel conformal data contamination tests that provide rigorous, distribution-free quality guarantees for external data in collaborative learning scenarios.

Findings

01

Tests remain valid under arbitrary contamination levels

02

Effective in diverse collaborative learning scenarios

03

Enables false discovery rate control

Abstract

The amount of quality data in many machine learning tasks is limited to what is available locally to data owners. The set of quality data can be expanded through trading or sharing with external data agents. However, data buyers need quality guarantees before purchasing, as external data may be contaminated or irrelevant to their specific learning task. Previous works primarily rely on distributional assumptions about data from different agents, relegating quality checks to post-hoc steps involving costly data valuation procedures. We propose a distribution-free, contamination-aware data-sharing framework that identifies external data agents whose data is most valuable for model personalization. To achieve this, we introduce novel two-sample testing procedures, grounded in rigorous theoretical foundations for conformal outlier detection, to determine whether an agent's data exceeds a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Imbalanced Data Classification Techniques · Ethics and Social Impacts of AI