OverNaN: NaN-Aware Oversampling for Imbalanced Learning with Meaningful Missingness

Amanda S Barnard

arXiv:2605.11525·cs.LG·May 13, 2026

OverNaN: NaN-Aware Oversampling for Imbalanced Learning with Meaningful Missingness

Amanda S Barnard

PDF

TL;DR

OverNaN introduces a NaN-aware oversampling method that preserves meaningful missingness in imbalanced datasets, enhancing data augmentation without losing valuable information.

Contribution

It extends synthetic oversampling techniques to operate directly on incomplete data, maintaining missingness structure rather than imputing or deleting it.

Findings

01

OverNaN effectively retains missingness during oversampling.

02

It improves class balance without distorting missing data patterns.

03

The method is suitable for small, incomplete datasets in scientific domains.

Abstract

Missing values are routinely treated as defects to be eliminated through deletion or imputation prior to machine learning. In many applied domains, however, missingness itself carries information, reflecting experimental constraints, measurement choices, or systematic mechanisms tied to the data-generating process. Eliminating or masking this structure can distort class boundaries, introduce bias, and reduce generalisability; particularly in imbalanced datasets where minority classes are already under-represented. OverNaN is a lightweight, NaN-aware oversampling framework designed to address class imbalance without erasing missingness structure. It extends common synthetic oversampling methods to operate directly on incomplete feature vectors, allowing missing values to be preserved, propagated, or selectively interpolated according to explicitly defined strategies. Rather than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.