# Gradient boosting with knockoff filters: a biostatistical approach to variable selection

**Authors:** Amr Mohamed, Kevin H. Lee

PMC · DOI: 10.1186/s12859-025-06215-z · BMC Bioinformatics · 2025-11-25

## TL;DR

This paper introduces a new method for selecting important variables in big data using LightGBM and knockoff filters, improving accuracy and efficiency.

## Contribution

The novel integration of knockoff filters with LightGBM for variable selection is proposed.

## Key findings

- The proposed method outperforms traditional approaches in identifying important variables.
- It improves speed and efficiency across multiple datasets.
- Simulation studies validate the enhanced performance and interpretability.

## Abstract

As data complexity and volume increase rapidly, efficient statistical methods for identifying significant variables become crucial. Variable selection plays a vital role in establishing relationships between predictors and response variables. The challenge lies in achieving this goal while controlling the False Discovery Rate (FDR) and maintaining statistical power. The knockoff filter, a recent approach, generates inexpensive knockoff variables that mimic the correlation structure of the original variables, serving as negative controls for inference. In this study, we extend the use of knockoffs to Light Gradient Boosting Machine (LightGBM), a fast and accurate machine learning technique. Shapely Additive Explanations (SHAP) values are employed to interpret the black-box nature of machine learning. Through extensive experimentation, our proposed method outperforms traditional approaches, accurately identifying important variables for each class. It offers improved speed and efficiency across multiple datasets. To validate our approach, an extensive simulation study is conducted. The integration of knockoffs into LightGBM enhances performance and interpretability, contributing to the advancement of variable selection methods. Our research addresses the challenges of variable selection in the era of big data, providing a valuable tool for identifying relevant variables in statistical modeling and machine learning applications.

## Full-text entities

- **Genes:** SHROOM4 (shroom family member 4) [NCBI Gene 57477] {aka MRXSSDS, SHAP, shrm4}
- **Diseases:** prostate cancer (MESH:D011471)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12801829/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12801829/full.md

## References

4 references — full list in the complete paper: https://tomesphere.com/paper/PMC12801829/full.md

---
Source: https://tomesphere.com/paper/PMC12801829