Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies
Qi Liu, Wanjing Ma

TL;DR
This paper examines how data corruption affects machine learning models, analyzing the impact of missing and noisy data, and evaluates strategies like imputation and dataset enlargement to improve robustness in real-world scenarios.
Contribution
It introduces a comprehensive analysis of data corruption effects, models the performance decline, and provides practical guidelines for data preprocessing and augmentation strategies.
Findings
Model performance declines exponentially with increased data corruption.
Noisy data causes more severe degradation than missing data.
Increasing dataset size has diminishing returns in mitigating corruption effects.
Abstract
Data corruption, including missing and noisy data, poses significant challenges in real-world machine learning. This study investigates the effects of data corruption on model performance and explores strategies to mitigate these effects through two experimental setups: supervised learning with NLP tasks (NLP-SL) and deep reinforcement learning for traffic signal optimization (Signal-RL). We analyze the relationship between data corruption levels and model performance, evaluate the effectiveness of data imputation methods, and assess the utility of enlarging datasets to address data corruption. Our results show that model performance under data corruption follows a diminishing return curve, modeled by the exponential function. Missing data, while detrimental, is less harmful than noisy data, which causes severe performance degradation and training instability, particularly in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Adversarial Robustness in Machine Learning
