Inverse Sampling of Degenerate Datasets from a Linear Regression Line

Albert S. Kim

arXiv:2108.11477·stat.ME·August 27, 2021

Inverse Sampling of Degenerate Datasets from a Linear Regression Line

Albert S. Kim

PDF

Open Access

TL;DR

This paper introduces a method for generating multiple datasets with identical statistical properties to a given linear regression, aiding in testing models and understanding data degeneracy.

Contribution

It presents a novel algorithm for inverse sampling of degenerate datasets that match the statistical characteristics of a reference dataset.

Findings

01

Characterized the Anscombe datasets.

02

Developed a general algorithm for creating degenerate datasets.

03

Facilitated testing of statistical models with identical data properties.

Abstract

When linear regression generates a relationship between a (dependent) scalar response and one or multiple independent variables, various datasets providing distinct graphical trends can develop resembling relationships based on the same statistical properties. Advanced statistical approaches, such as neural networks and machine learning methods, are of great necessity to process, characterize, and analyze these degenerate datasets. On the other hand, the accurate creation of purposedly degenerate datasets is essential to test new models in the research and education of applied statistics. In this light, the present study characterizes the famous Anscombe datasets and provides a general algorithm for creating multiple paired datasets of identical statistical properties.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Statistical Methods and Models · Statistical Mechanics and Entropy · Gaussian Processes and Bayesian Inference