Best Practices for Developing Linear Models With Multiple Explanatory Variables
Baidu Li, Xinhai Li

TL;DR
This paper outlines best practices for building linear models with multiple explanatory variables, emphasizing techniques like interaction terms, variable screening, and regularization.
Contribution
The paper introduces a systematic approach for model selection, highlighting the use of interaction terms and shrinkage methods in moderate-sized datasets.
Findings
Including two-way interactions and quadratic terms improves model accuracy.
Random forest screening followed by stepwise regression enhances variable selection.
Shrinkage methods like lasso and ridge regression improve model fitting in high-dimensional data.
Abstract
Linear models, including t‐test, ANOVA, regression, ANCOVA, and generalized linear models, are foundational tools in statistical analysis. For large datasets, such as those involving tens of thousands of genes and millions of records, numerous advanced methods have been developed to improve both computational efficiency and reliability. Here, we focus on a more general scenario: a linear model with many explanatory variables (e.g., >10) and a moderate sample size (e.g., thousands of observations). This paper provides the best practices for model selection, emphasizing the importance of including two‐way interaction and quadratic terms, which are frequently overlooked in textbooks and classic literature. When dealing with high‐dimensional data, we recommend using random forest for initial variable screening, followed by subset selection methods such as stepwise regression. Model…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Analysis with R · Advanced Statistical Modeling Techniques · Advanced Statistical Methods and Models
