# Repeated Sieving for Prediction Model Building with High-Dimensional Data

**Authors:** Lu Liu, Sin-Ho Jung

PMC · DOI: 10.3390/jpm14070769 · Journal of Personalized Medicine · 2024-07-19

## TL;DR

This paper introduces a new method called repeated sieving to improve prediction models by selecting fewer but more significant features from high-dimensional data.

## Contribution

The novel contribution is the proposed repeated sieving method, which addresses over-selection issues in existing ML methods like LASSO and Elastic Net.

## Key findings

- Repeated sieving selects far fewer features than LASSO and Elastic Net.
- The method achieves higher prediction accuracy compared to existing ML methods.
- It reduces the cost of future investigations by selecting fewer significant features.

## Abstract

Background: The prediction of patients’ outcomes is a key component in personalized medicine. Oftentimes, a prediction model is developed using a large number of candidate predictors, called high-dimensional data, including genomic data, lab tests, electronic health records, etc. Variable selection, also called dimension reduction, is a critical step in developing a prediction model using high-dimensional data. Methods: In this paper, we compare the variable selection and prediction performance of popular machine learning (ML) methods with our proposed method. LASSO is a popular ML method that selects variables by imposing an L1-norm penalty to the likelihood. By this approach, LASSO selects features based on the size of regression estimates, rather than their statistical significance. As a result, LASSO can miss significant features while it is known to over-select features. Elastic net (EN), another popular ML method, tends to select even more features than LASSO since it uses a combination of L1- and L2-norm penalties that is less strict than an L1-norm penalty. Insignificant features included in a fitted prediction model act like white noises, so that the fitted model will lose prediction accuracy. Furthermore, for the future use of a fitted prediction model, we have to collect the data of all the features included in the model, which will cost a lot and possibly lower the accuracy of the data if the number of features is too many. Therefore, we propose an ML method, called repeated sieving, extending the standard regression methods with stepwise variable selection. By selecting features based on their statistical significance, it resolves the over-selection issue with high-dimensional data. Results: Through extensive numerical studies and real data examples, our results show that the repeated sieving method selects far fewer features than LASSO and EN, but has higher prediction accuracy than the existing ML methods. Conclusions: We conclude that our repeated sieving method performs well in both variable selection and prediction, and it saves the cost of future investigation on the selected factors.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11277592/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC11277592/full.md

## References

23 references — full list in the complete paper: https://tomesphere.com/paper/PMC11277592/full.md

---
Source: https://tomesphere.com/paper/PMC11277592