From Garbage to Gold: A Data-Architectural Theory of Predictive Robustness
Terrence J. Lee-St. John, Jordan L. Lawson, Bartlomiej Piechowski-Jozwiak

TL;DR
This paper presents a data-architectural theory explaining how high-dimensional, error-prone data can still yield robust predictive models, emphasizing the synergy between data structure and model capacity over data cleanliness.
Contribution
It introduces a novel theoretical framework combining Information Theory, Latent Factor Models, and Psychometrics to explain predictive robustness in high-D, error-prone data environments.
Findings
High-D predictor sets asymptotically overcome noise types
Informative collinearity enhances model reliability
Dimensionality reduces latent inference burden
Abstract
Tabular machine learning presents a paradox: modern models achieve state-of-the-art performance using high-dimensional (high-D), collinear, error-prone data, defying the "Garbage In, Garbage Out" mantra. To help resolve this, we synthesize principles from Information Theory, Latent Factor Models, and Psychometrics, clarifying that predictive robustness arises not solely from data cleanliness, but from the synergy between data architecture and model capacity. Partitioning predictor-space "noise" into "Predictor Error" and "Structural Uncertainty" (informational deficits from stochastic generative mappings), we prove that leveraging high-D sets of error-prone predictors asymptotically overcomes both types of noise, whereas cleaning a low-D set is fundamentally bounded by Structural Uncertainty. We demonstrate why "Informative Collinearity" (dependencies from shared latent causes) enhances…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Generative Adversarial Networks and Image Synthesis
