Detecting Mislabeled and Corrupted Data via Pointwise Mutual Information

Jinghan Yang; Jiayu Weng

arXiv:2508.07713·cs.LG·August 12, 2025

Detecting Mislabeled and Corrupted Data via Pointwise Mutual Information

Jinghan Yang, Jiayu Weng

PDF

Open Access

TL;DR

This paper introduces a mutual information-based method to identify and filter out mislabeled and corrupted data in neural network training, improving model accuracy and robustness.

Contribution

It presents a novel mutual information framework that quantifies data quality by analyzing pointwise contributions, effectively filtering noisy samples in hybrid noise scenarios.

Findings

01

Filtering low-MI samples improves classification accuracy by up to 15%.

02

The method effectively detects mislabeled and corrupted data.

03

Robustness to benign input modifications preserves valid data.

Abstract

Deep neural networks can memorize corrupted labels, making data quality critical for model performance, yet real-world datasets are frequently compromised by both label noise and input noise. This paper proposes a mutual information-based framework for data selection under hybrid noise scenarios that quantifies statistical dependencies between inputs and labels. We compute each sample's pointwise contribution to the overall mutual information and find that lower contributions indicate noisy or mislabeled instances. Empirical validation on MNIST with different synthetic noise settings demonstrates that the method effectively filters low-quality samples. Under label corruption, training on high-MI samples improves classification accuracy by up to 15\% compared to random sampling. Furthermore, the method exhibits robustness to benign input modifications, preserving semantically valid data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Explainable Artificial Intelligence (XAI) · Text and Document Classification Technologies