Divide and Predict: An Architecture for Input Space Partitioning and Enhanced Accuracy
Fenix W. Huang, Henning S. Mortveit, and Christian M. Reidys

TL;DR
This paper introduces a variance-based measure to quantify data heterogeneity, enabling effective data partitioning that improves supervised learning accuracy, demonstrated through experiments on image and synthetic data.
Contribution
It presents a novel variance measure for data heterogeneity and shows how partitioning data based on this measure enhances model performance.
Findings
Variance captures data heterogeneity effectively.
Partitioning data improves test accuracy.
Variance peaks at equal distribution mixes.
Abstract
In this article the authors develop an intrinsic measure for quantifying heterogeneity in training data for supervised learning. This measure is the variance of a random variable which factors through the influences of pairs of training points. The variance is shown to capture data heterogeneity and can thus be used to assess if a sample is a mixture of distributions. The authors prove that the data itself contains key information that supports a partitioning into blocks. Several proof of concept studies are provided that quantify the connection between variance and heterogeneity for EMNIST image data and synthetic data. The authors establish that variance is maximal for equal mixes of distributions, and detail how variance-based data purification followed by conventional training over blocks can lead to significant increases in test accuracy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Image Segmentation Techniques · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
