# Precision in prediction: tailoring machine learning models for breast cancer missense variants pathogenicity prediction

**Authors:** Rahaf M Ahmad, Noura AlDhaheri, Mohd Saberi Mohamad, Bassam R Ali

PMC · DOI: 10.1093/bib/bbaf611 · 2025-11-20

## TL;DR

This study improves breast cancer variant predictions using machine learning models tailored to breast cancer genes, offering more accurate and interpretable results than general genome-wide tools.

## Contribution

The study introduces a disease-specific machine learning approach for breast cancer missense variant prediction with integrated interpretability techniques.

## Key findings

- The Extra Trees model achieved 99.1% accuracy on an independent ClinGen dataset.
- Recursive feature elimination identified key genomic features for efficient prediction.
- Interpretability techniques enhanced transparency and highlighted key drivers of predictions.

## Abstract

Accurate classification of genetic variants is critical for precision medicine, particularly hereditary diseases such as breast cancer. However, widely used tools like MutPred and Combined Annotation Dependent Depletion (CADD) offer genome-wide pathogenicity predictions that often overlook disease-specific variant behavior, limiting their clinical utility. This study addresses that gap by training and benchmarking nine machine learning (ML) models-including ensemble and baseline classifiers-on a breast cancer gene-specific dataset rich in conservation scores, functional annotations, and allele frequency features. Among all models, the Extra Trees model achieved the highest performance, with an accuracy of 0.999 and a 95% confidence interval of (0.998–1.000). recursive feature elimination identified the most informative genomic features, enhancing model efficiency. To ensure clinical transparency, we applied interpretability techniques including Local Interpretable Model-Agnostic Explanations and permutation feature importance, which highlighted the key drivers of each prediction. The calibration curve further confirmed the reliability of predicted probabilities, supporting their potential use in clinical decision-making. On an independent ClinGen dataset, Extra Trees achieved 99.1% accuracy and outperformed widely used predictors confirming its robustness and clinical applicability. This is the first comprehensive benchmarking study to apply ML models specifically to breast cancer-related missense variants using disease-gene-specific training data and integrated interpretability. Our results show that disease-specific ML approaches outperform general predictors, offering improved reliability, transparency, and relevance to clinical genomics. By bridging the gap between broad genome-wide tools and tailored clinical prediction, this study lays the foundation for implementing ML-driven pathogenicity prediction in breast cancer diagnostics and precision medicine, with potential expansion to other disease contexts.

## Linked entities

- **Diseases:** breast cancer (MONDO:0004989)

## Full-text entities

- **Diseases:** breast cancer (MESH:D001943), hereditary diseases (MESH:D030342)

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12632190/full.md

---
Source: https://tomesphere.com/paper/PMC12632190