Data Origin Inference in Machine Learning
Mingxue Xu, Xiang-Yang Li

TL;DR
This paper introduces a novel method for inferring the origin of training data in machine learning models, aiding developers in identifying missed or faulty data sources without extensive metadata, with high accuracy demonstrated across various data types.
Contribution
It presents a new data origin inference strategy combining embedded-space classification and shadow training, applicable to diverse data types and origins, with comprehensive performance analysis.
Findings
Achieves 98.96% accuracy in language use case with transformer models
Effective across language, visual, and structured data types
Provides statistical insights into data origin inference success rates
Abstract
It is a growing direction to utilize unintended memorization in ML models to benefit real-world applications, with recent efforts like user auditing, dataset ownership inference and forgotten data measurement. Standing on the point of ML model development, we introduce a process named data origin inference, to assist ML developers in locating missed or faulty data origin in training set without maintaining strenuous metadata. We formally define the data origin and the data origin inference task in the development of the ML model (mainly neural networks). Then we propose a novel inference strategy combining embedded-space multiple instance classification and shadow training. Diverse use cases cover language, visual and structured data, with various kinds of data origin (e.g. business, county, movie, mobile user, text author). A comprehensive performance analysis of our proposed strategy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Data Storage Technologies · Data Quality and Management
