Feature Dimensionality Outweighs Model Complexity in Breast Cancer Subtype Classification Using TCGA-BRCA Gene Expression Data

Meena Al Hasani

arXiv:2605.06562·cs.LG·May 8, 2026

Feature Dimensionality Outweighs Model Complexity in Breast Cancer Subtype Classification Using TCGA-BRCA Gene Expression Data

Meena Al Hasani

PDF

TL;DR

In breast cancer subtype classification using TCGA gene expression data, feature selection impacts performance more than model complexity, with simpler models like logistic regression providing more balanced results.

Contribution

This study systematically compares the effects of model complexity and feature selection on classification performance in high-dimensional gene expression data.

Findings

01

Feature selection significantly influences classification accuracy.

02

Logistic regression offers stable performance across subtypes.

03

Model complexity has less impact than feature dimensionality.

Abstract

Accurate classification of breast cancer subtypes from gene expression data is critical for diagnosis and treatment selection. However, such datasets are characterized by high dimensionality and limited sample size, posing challenges for machine learning models. In this study, we evaluate the impact of model complexity and feature selection on subtype classification performance using TCGA-BRCA gene expression data. Logistic regression, random forest, and support vector machine (SVM) models were trained using varying numbers of highly variable genes (50 to 20,518). Performance was evaluated using stratified 5-fold cross-validation and assessed with accuracy and macro F1 score. While all models achieved high accuracy, macro F1 analysis revealed substantial differences in subtype-level performance. Logistic regression demonstrated the most stable and balanced performance across subtypes,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.