Inverse Sampling of Degenerate Datasets from a Linear Regression Line
Albert S. Kim

TL;DR
This paper introduces a method for generating multiple datasets with identical statistical properties to a given linear regression, aiding in testing models and understanding data degeneracy.
Contribution
It presents a novel algorithm for inverse sampling of degenerate datasets that match the statistical characteristics of a reference dataset.
Findings
Characterized the Anscombe datasets.
Developed a general algorithm for creating degenerate datasets.
Facilitated testing of statistical models with identical data properties.
Abstract
When linear regression generates a relationship between a (dependent) scalar response and one or multiple independent variables, various datasets providing distinct graphical trends can develop resembling relationships based on the same statistical properties. Advanced statistical approaches, such as neural networks and machine learning methods, are of great necessity to process, characterize, and analyze these degenerate datasets. On the other hand, the accurate creation of purposedly degenerate datasets is essential to test new models in the research and education of applied statistics. In this light, the present study characterizes the famous Anscombe datasets and provides a general algorithm for creating multiple paired datasets of identical statistical properties.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Methods and Models · Statistical Mechanics and Entropy · Gaussian Processes and Bayesian Inference
