This is not normal! (Re-) Evaluating the lower $n$ guidelines for regression analysis
David Randahl

TL;DR
This study challenges the traditional rule of thumb that a sample size of 30 is necessary for valid regression inferences, showing that distributional characteristics significantly influence the required sample size for convergence.
Contribution
The paper provides new, simulation-based guidelines indicating that smaller sample sizes may suffice under certain distributional conditions, revising the conventional $n \\geq 30$ rule.
Findings
Symmetric or platykurtic variables allow smaller sample sizes for convergence.
Highly skewed variables require larger sample sizes for reliable t-values.
The traditional $n \\geq 30$ rule is overly conservative or insufficient depending on distribution.
Abstract
The commonly cited rule of thumb for regression analysis, which suggests that a sample size of is sufficient to ensure valid inferences, is frequently referenced but rarely scrutinized. This research note evaluates the lower bound for the number of observations required for regression analysis by exploring how different distributional characteristics, such as skewness and kurtosis, influence the convergence of t-values to the t-distribution in linear regression models. Through an extensive simulation study involving over 22 billion regression models, this paper examines a range of symmetric, platykurtic, and skewed distributions, testing sample sizes from 4 to 10,000. The results show that it is sufficient that either the dependent or independent variable follow a symmetric distribution for the t-values to converge at much smaller sample sizes than , unless the other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference
