Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies

Qi Liu; Wanjing Ma

arXiv:2412.18296·cs.LG·May 22, 2025

Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies

Qi Liu, Wanjing Ma

PDF

Open Access 1 Repo

TL;DR

This paper examines how data corruption affects machine learning models, analyzing the impact of missing and noisy data, and evaluates strategies like imputation and dataset enlargement to improve robustness in real-world scenarios.

Contribution

It introduces a comprehensive analysis of data corruption effects, models the performance decline, and provides practical guidelines for data preprocessing and augmentation strategies.

Findings

01

Model performance declines exponentially with increased data corruption.

02

Noisy data causes more severe degradation than missing data.

03

Increasing dataset size has diminishing returns in mitigating corruption effects.

Abstract

Data corruption, including missing and noisy data, poses significant challenges in real-world machine learning. This study investigates the effects of data corruption on model performance and explores strategies to mitigate these effects through two experimental setups: supervised learning with NLP tasks (NLP-SL) and deep reinforcement learning for traffic signal optimization (Signal-RL). We analyze the relationship between data corruption levels and model performance, evaluate the effectiveness of data imputation methods, and assess the utility of enlarging datasets to address data corruption. Our results show that model performance under data corruption follows a diminishing return curve, modeled by the exponential function. Missing data, while detrimental, is less harmful than noisy data, which causes severe performance degradation and training instability, particularly in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qiliuchn/data-corruption-study
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Adversarial Robustness in Machine Learning