# MFF-AE: Enhanced Quality Control for Proteomics Mass Spectrometry Data via Multi-Scale Feature Fusion

**Authors:** Guangkui Fan, Xinyu Ji, Hunyue Liao, Bo Meng, Duotao Pan, Jinze Huang, Yang Zhao

PMC · DOI: 10.3390/ijms27052121 · 2026-02-25

## TL;DR

This paper introduces MFF-AE, a deep learning model that improves quality control in proteomics mass spectrometry data by detecting anomalous samples more accurately than existing methods.

## Contribution

The novel MFF-AE model integrates multi-scale features using a deep learning autoencoder to enhance anomaly detection in proteomics data.

## Key findings

- MFF-AE outperforms 15 baseline models in detecting anomalous samples on a benchmark dataset.
- Excluding outliers identified by MFF-AE increases statistical significance and fold change in differential proteins in clinical datasets.

## Abstract

Mass spectrometry (MS) is a core analytical tool in proteomics, and the quality of the generated data directly determines the effectiveness of downstream analyses and the reliability of final research conclusions. While MS is also widely used in other omics applications, this study focuses on label-free quantitative proteomics, where samples are represented as protein-abundance matrices derived from MaxQuant. However, MS data are typically characterized by high dimensionality and substantial noise, posing serious challenges for quality control (QC). Existing QC methods have limited feature extraction capabilities and struggled to capture the key information embedded in the data, resulting in poor performance in identifying anomalous samples. Here, we propose the Multi-Scale Feature Fusion-based Autoencoder (MFF-AE). This deep learning-based anomaly detection model achieves precise identification of anomalous samples by integrating both global and local data features. The model consists of three modules: an autoencoder-based backbone network that efficiently embeds raw data into a low-dimensional semantic space, a local feature extraction and fusion module designed to capture and integrate multi-scale features within MS data, and a sample identification module that enhances discriminative representations to enable accurate anomaly detection. To evaluate the effectiveness of the proposed model, we conduct extensive experiments on a benchmark dataset with synthesized anomalies. Quantitative results on the benchmark dataset show that, compared with 15 baseline models from statistical learning, deep learning, and ensemble learning, our model consistently achieves the best performance across key metrics. Furthermore, through linear relationship analysis on real-world clinical datasets, the exclusion of outlier samples significantly increased the statistical significance and fold change in the identified differential proteins. Overall, the proposed model establishes a solid data foundation, paving the way for downstream mechanistic studies and target discovery.

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12984335/full.md

---
Source: https://tomesphere.com/paper/PMC12984335