# Automated sparse feature selection in high-dimensional proteomics data via 1-bit compressed sensing and K-Medoids clustering

**Authors:** FuDong Wen, Yue Su, Dan Liu, YuPeng Wang, MeiNa Liu

PMC · DOI: 10.1186/s12859-025-06193-2 · BMC Bioinformatics · 2025-07-01

## TL;DR

A new method called ST-CS improves biomarker discovery in proteomics by automatically selecting relevant features while reducing noise and computational effort.

## Contribution

ST-CS introduces a novel hybrid framework combining 1-bit compressed sensing and K-Medoids clustering for automated, sparse feature selection in proteomics.

## Key findings

- ST-CS outperformed existing methods in feature selection robustness with high sensitivity and specificity.
- It achieved superior classification performance with higher AUC scores on multiple cancer datasets.
- ST-CS selected fewer features while maintaining or improving predictive accuracy compared to other methods.

## Abstract

High-dimensional proteomics data present significant challenges in biomarker discovery due to technical noise, feature redundancy, and multicollinearity. Current feature selection methods, including filter, wrapper, and embedded approaches, struggle with stability, sparsity, and computational efficiency. To address these limitations, we propose Soft-Thresholded Compressed Sensing (ST-CS), a hybrid framework integrating 1-bit compressed sensing with K-Medoids clustering. Unlike conventional methods relying on manual thresholds, ST-CS automates feature selection by dynamically partitioning coefficient magnitudes into discriminative biomarkers and noise.

Evaluations on simulated and real-world proteomic datasets demonstrated ST-CS’s superiority in feature selection capability and classification performance. In simulations, ST-CS achieved feature selection robustness with balanced sensitivity (> 80%) and specificity (> 99.8%), reducing false discovery rates (FDR) by 20–50% compared to Hard-Thresholded Compressed Sensing (HT-CS). Additionally, it attained superior F1 scores and Matthews Correlation Coefficients (MCC), outperforming HT-CS, LASSO, and SPLSDA in identifying true biomarkers while suppressing noise. For classification performance, ST-CS surpassed all methods in the area under the receiver operating characteristic curve (AUC) across varying noise levels while maintaining sparsity. Applied to Clinical Proteomic Tumor Analysis Consortium (CPTAC) datasets, ST-CS matched HT-CS’s classification accuracy (AUC = 97.47% for intrahepatic cholangiocarcinoma) but with 57% fewer selected features (37 vs. 86), demonstrating its dual strength in precision biomarker discovery and predictive accuracy. For glioblastoma data, ST-CS achieved higher AUC (72.71%) than HT-CS (72.15%), LASSO (67.80%), and SPLSDA (71.38%) while retaining a parsimonious feature set (30 vs. 58 features for HT-CS). In ovarian serous cystadenocarcinoma, ST-CS further demonstrated its adaptability, attaining superior AUC (75.86%) over HT-CS (75.61%), LASSO (61.00%), and SPLSDA (70.75%) with only 24 ± 5 selected biomarkers. These results highlight ST-CS’s ability to rigorously automate feature selection while balancing classification efficacy, interpretability, and scalability for translational proteomics.

The online version contains supplementary material available at 10.1186/s12859-025-06193-2.

## Linked entities

- **Diseases:** intrahepatic cholangiocarcinoma (MONDO:0003210), glioblastoma (MONDO:0018177), ovarian serous cystadenocarcinoma (MONDO:0006046)

## Full-text entities

- **Diseases:** ovarian serous cystadenocarcinoma (MESH:D010049), Tumor (MESH:D009369), glioblastoma (MESH:D005909)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12220089/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12220089/full.md

---
Source: https://tomesphere.com/paper/PMC12220089