mspms: an R package and GUI for multiplex substrate profiling by mass spectrometry
Charlie Bayne, Brianna Hurysz, David J. Gonzalez, Anthony O’Donoghue

TL;DR
The mspms package provides an accessible tool for analyzing complex protease substrate data using mass spectrometry, making it easier for researchers to study enzyme specificity.
Contribution
Mspms is the first comprehensive, user-friendly platform for analyzing multiplex substrate profiling data by mass spectrometry.
Findings
Mspms reliably captures expected substrate specificities when validated with data from four cathepsins.
The tool integrates preprocessing, normalization, statistical testing, and visualization into a single framework.
Mspms is available as an R package and a web-based GUI, enhancing accessibility for diverse users.
Abstract
Multiplex Substrate Profiling by Mass Spectrometry (MSP-MS) is a powerful method for determining the substrate specificity of proteolytic enzymes, which is essential for developing protease inhibitors, diagnostics, and protease-activated therapeutics. However, the complex datasets generated by MSP-MS pose significant analytical challenges and have limited accessibility for non-specialist users. We developed mspms, a Bioconductor R package with an accompanying graphical interface, to streamline the analysis of MSP-MS data. Mspms standardizes workflows for data preparation, processing, statistical analysis, and visualization. The tool is designed for accessibility, serving advanced users through the R package and broader audiences through a web-based interface. We validated mspms using data from four well-characterized cathepsins (A–D), demonstrating that it reliably captures expected…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3- —National Institute of General Medical Sciences, United States
- —https://doi.org/10.13039/100000057National Institute of General Medical Sciences
- —https://doi.org/10.13039/100000060National Institute of Allergy and Infectious Diseases
- —https://doi.org/10.13039/100000054National Cancer Institute
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Proteomics Techniques and Applications · Mass Spectrometry Techniques and Applications · Advanced Biosensing Techniques and Applications
Introduction
Proteases play crucial roles in a wide range of biological processes, from digestion and immunity to cancer and neurodegenerative diseases [1]. Understanding the substrate specificity of these enzymes is essential for designing inhibitors, diagnostics, and protease-activated therapeutics [2]. One of the most effective methods for determining protease substrate specificity is Multiplex Substrate Profiling by Mass Spectrometry (MSP-MS) [3]. This technique involves incubating a rationally designed peptide library with a protease or protease-containing sample and using mass spectrometry to identify the resulting cleavage products, revealing the enzyme’s substrate preferences [4].
The data produced by MSP-MS are complex and multi-dimensional. Accurate interpretation of these results requires rigorous data analysis, encompassing multiple steps: preparing the data (identifying cleavage motifs and positions), processing the data (data transformation, normalization, and imputation), statistical analysis, and data visualization. MSP-MS data analysis has historically relied on MSP-Xtractor software (https://pharm.ucsf.edu/craik/research/extractor) and/or custom, sparsely documented scripts. MSP-Xtractor provides basic functionality for simplifying the output generated by ProteinProspector [5], but it lacks integration with downstream normalization, statistical testing, or visualization tools. Custom scripts extended the analysis to include imputation and basic statistical testing but depended heavily on manual steps, including exporting data to spreadsheets, editing hardcoded variables, and uploading processed results to the IceLogo graphical interface for visualization [6].
While functional, these labor-intensive steps were error-prone and limited scalability, accessibility, and reproducibility. Inconsistent and irreproducible analysis pipelines have been noted to cause significant problems in biological research [7], often as a consequence of decentralized, error-prone code evolution when personnel transition [8]. The absence of standardized, reproducible data analysis tools for MSP-MS has hindered scientific progress by creating barriers to collaboration across research groups. Although a recently published standardized analysis protocol improved transparency, it remains non-automated and lacks integration with interactive visualization and statistical frameworks [3].
To address these challenges, we developed mspms, an R package specifically designed for the robust, reproducible analysis of MSP-MS data. Through integration into the Bioconductor ecosystem [9], the mspms package adheres to best practices in software development and data analysis, offering a transparent and portable solution for processing complex datasets. Recognizing that many users may not have programming experience, we complemented the R package with a user-friendly graphical interface, available both as a web application and for download. This interface allows researchers to perform key MSP-MS analysis steps—data preprocessing, normalization, statistical analysis, and visualization—without needing any R programming knowledge.
Here, we demonstrate the functionality of mspms by analyzing publicly available MSP-MS data for four well-characterized cathepsins, validating the package’s ability to accurately determine their substrate specificities. By offering comprehensive functionality, transparency, and user accessibility, mspms is positioned to be a valuable tool for the protease research community, streamlining the analysis of MSP-MS data while promoting reproducible research.
Materials and methods for study
Raw data from a previously reported MSP-MS study was acquired from the MassIVE Repository (accession number MSV00008595) [10]. In brief, these data were generated from a study that utilized a 228-member, rationally designed 14-mer peptide library covering diverse amino acid contexts that was incubated with either 18.4 nM cathepsin A, 2.64 nM cathepsin B, 19.6 nM cathepsin C or 100 nM cathepsin D. The concentration of each peptide in the library was 0.5 μM. After incubation at 37 °C for defined time points, the reaction was quenched by addition of 6.4 M guanidine hydrochloride. These samples were desalted with C18 spin columns, and ~ 0.4 μg of each sample was subjected to LC-MS/MS analysis using an Ultimate 3000 HPLC and Q-Exactive mass spectrometer. LC and MS parameters were as previously reported [10].
Upstream proteomic software
Peptides/proteins were identified and quantified using PEAKS Studio [11], Proteome Discoverer [12], FragPipe [13], and Sage [14]. The database used in each search was the 228-member peptide library described previously [3] (Supplementary File 1).
PEAKS studio
Data from all.raw files were processed using PEAKS Studio v8.5 software, using a customized template (Supplementary File 2). For each sample experiment specific parameters were set as follows: Q-Exactive instrument, HCD fragmentation, no enzyme. Scans were merged with a retention time window of 0.8 min, and precursor m/z error tolerance of 10 ppm. Precursor mass was corrected. Scans were filtered to include retention time between 0 and 95 min with a precursor mass tolerance of 10 ppm. For identification, a precursor mass tolerance of 20 ppm using monoisotopic mass and a fragmentation ion of 0.01 Da was specified. No PTMs were included in the search. FDR was estimated using decoy-fusion strategy. Label free quantification was performed with a mass error tolerance of 9 ppm, and retention time shift tolerance of 3 min. Replicate samples were added to new groups.
The peaks_protein-peptides-lfq.csv file was prepared by navigating to the quantification options setting the normalization factor to “No normalization”, changing peptide filters to include all peptides (quality ≥ 0, Avg.Area ≥ 0, Peptide ID Count ≥ 0, charge + 1–+ 10, and at least 1 confident sample). Protein filters were changed so no filtering occurs (Significance ≥ 0). Data was then exported as the peptides-lfq.csv file.
Data in figures is derived from the PEAKS software, unless otherwise specified.
Proteome Discoverer
Data from all.raw files was also processed using Proteome Discoverer V 2.5.0.400 using a customized processing and consensus workflow. (Supplemental Files 3 and 4). Briefly, min precursor mass was specified as 350 Da, max precursor mass was specified as 5000. Enzyme was set to be unspecific, with a min peptide length of 5, and max peptide length of 14. Precursor mass tolerance was set to 10 ppm, and fragment mass tolerance was set to 0.6. Percolator Target FDR was set to 0.01.
FragPipe
FragPipe V22.0. MSFragger version 4.1, IonQuant version 1.10.27, and Python version 3.9.13 were used to process all.raw files with a customized analysis workflow derived from the MBR-LFQ workflow template (Supplementary File 5). Briefly, decoys were added to the peptide library database, cleavages were set to nonspecific; peptide length was set at 5–14, and 350–5000 Da, match between runs was enabled, and top runs was set to 3 (as there were 4 biological replicates in each group except for cathepsin D at time zero).
Sage
Sage v0.14.7 was used to analyze mzML files using config settings appropriate for MSP-MS analysis (Supplementary File 6). Briefly, cleavages were set to nonspecific through the cleave_at = “” parameter, peptide lengths were restricted to 5–14 amino acids, and peptide masses were constrained between 350 and 5000 Da.
Implementation
Mspms is implemented as an R package within the Bioconductor ecosystem, providing standardized workflows for preprocessing, processing, statistical analysis, and visualization of MSP-MS data.
Preprocessing
To prepare MSP-MS data for analysis, mspms preprocesses an exported file from the user’s proteome software of choice. In this process, the data is converted to a standardized format and loaded as a QFeatures object [15] containing a SummarizedExperiment [16] object named “peptides”, which contains the detected peptide intensities. The QFeatures class was selected to store quantitative information because it is specifically designed for high-throughput mass spectrometry data, enabling native support for common data transformations such as normalization and imputation that are required for MSP-MS analysis, while also providing a natural framework to incorporate additional levels of aggregation (e.g., PSM-to-peptide) if desired in future extensions of the workflow. Historically, MSP-MS experiments have used either spectral counts or MS1 peak intensities for quantification [3]; in mspms, MSP-MS data are quantified using label-free MS1-based intensities rather than spectral counts because ion-current measurements generally provide higher quantitative accuracy and greater sensitivity for detecting small abundance differences on modern high-resolution instruments [17]. Cleavage motifs of a user-specified length on the N and C terminal side of the scissile bond (the peptide bond specifically cleaved by the protease, located between the P1 and P1′ residues) are calculated, and the numerical position of each cleavage site is determined with reference to the peptide from which it originated. These peptide centric features are then loaded as the rowData corresponding to the QFeatures object. The colData composing the QFeatures experiment contains sample metadata describing the experiment and must include descriptors core to every MSP-MS experiment: “group”, “condition”, and “time”.
Data processing
Peptide values are subjected to log_2_ transformation followed by a median centered normalization utilizing the center.median method. Due to the left-censored nature of MSP-MS data, imputation is subsequently performed using the QRILC method as implemented in the imputeLCMD R package [18]. Lastly, the data is reverse log_2_ transformed for visualization purposes. All data manipulation is performed using MScoreutils [19]. Data resulting from each step of data processing is stored within the resulting QFeatures object as SummarizedExperiment objects named “log2_peptides”, “log2_peptides_norm”, “log2_peptides_norm_imputed”, and “peptides_norm” respectively.
Quality control metrics
Mspms computes two quality control (QC) metrics and produces plots to assist users in evaluating MSP-MS data quality.
- The first metric reports the percentage of samples in which a given peptide from the library, either a full-length peptide or a cleavage product, is not detected. Full-length peptides are included because certain sequences may exhibit poor ionization and thus remain undetected, even if their corresponding cleavage products are observed.
- The second metric reports the percentage of the entire library that is undetected within each sample.
Together, these QC metrics help users diagnose issues related to data quality, ionization efficiency, or peptide preparation in their MSP-MS experiments.
Statistics
Mspms supports two approaches:
Linear modeling via* limma* [20] enabling flexible handling of complex designs (batch effects, repeated measures, covariates) and stabilized variance estimation through empirical Bayes methods.
Pairwise t-tests (via Rstatix) [21] with log₂ fold change relative to a user-defined denominator in colData.
All statistical tests were performed on log₂-transformed, imputed, and normalized peptide intensities.
In this study, results are reported using the limma-based approach, which provides comparable outcomes while improving speed, scalability, and statistical power relative to the pairwise t-test approach. Limma uses linear modeling with empirical Bayes moderation to stabilize variance estimates across peptides, increasing the reliability of statistical inference, particularly for peptides with few replicates. By default, significant peptides are defined as those with p.adj < 0.05 and log₂ fold change > 3 relative to time 0.
Visualizations
Static plots are generated using ggplot2 [22]. Interactive heatmaps are produced via heatmaply [23], built on plotly [24].
iceLogo analysis
To visualize substrate amino acid preferences surrounding protease cleavage sites, we reimplemented the iceLogo algorithm in R [25]. This approach statistically compares amino acid frequencies at defined positions relative to the scissile bond between an experimental set and a reference set and visualizes the differences as letter heights in a logo.
In our implementation, residues in the cleavage sequence that extend beyond the peptide termini are represented as “X,” allowing the iceLogo plots to capture information on the positional specificity of cleavage events (Supplementary Fig. 1).
Definition of sets
Experimental set User-defined, representing the cleavage sequences of interest. In our analyses, this includes significantly altered peptides (p.adj < 0.05 and log₂ fold change > 3) relative to time 0.
Reference set Comprises all possible cleavage sequences present in the MSP-MS peptide library.
Frequency calculation
For each position within a user-defined amino acid window around the scissile bond, the count and frequency of each amino acid are computed separately for both the experimental and reference sets.
Statistical testing
For each amino acid at each position, the standard deviation (σ) is calculated from the reference set frequency (f%):
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma =\sqrt{\frac{f\%}{N}}$$\end{document}where N represents the sample size, i.e., the number of peptides in the experimental set at that position.
The Z-score is then calculated as:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z=\frac{X-\mu }{\sigma }$$\end{document}where X is the frequency of the amino acid at that position in the experimental set, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu $$\end{document} is the corresponding frequency in the reference set. This Z-score indicates how many standard deviations the observed frequency of the experimental set deviates from that of the reference set.
Significance is subsequently determined by whether the Z-score falls outside the confidence interval, with the Wichura algorithm [26] used to convert Z-scores into p-values. Only amino acids with p-values less than or equal to the user-specified threshold are retained for visualization.
Visualization
Significant differences between the experimental and reference frequencies are visualized as letter heights using the ggseqlogo R package [27]. Users may choose either percent change (PC) or fold change (FC) to define the height of each amino acid letter:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$PC={F}^{+} -{F}^{-}$$\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$FC=\frac{{F}^{+} }{{F}^{-}}$$\end{document}where F^ + and F^- are the frequencies in the experimental and reference sets, respectively.
Fold changes smaller than 1 are converted to maintain positive/negative directionality:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$FCcon=\frac{1}{FC}\times -1$$\end{document}Handling extreme values
If only one amino acid is significant at a position and its calculated letter height is infinite, the height is set to the maximum allowable value in the logo.
If multiple amino acids at a given position are all significant and have infinite calculated heights, their combined height is capped at the maximal value displayable in the iceLogo plot.
Report generation
Mspms supports the production of a generic mspms report (Supplementary File 7). This function produces a generic self-contained.html report with embedded downloadable data frames (containing normalized data and results of statistics), and figures. This report is produced by leveraging the mspms R package inside of a parameterized rmarkdown [28] template incorporating the downloadthis [29] package.
Helper functions
To maintain an intuitive API, only a subset of functions are exported. Additional helpers are organized in helper_functions.R according to their scope.
Graphic interface
A graphic interface to the mspms R package is implemented through R shiny [30], hosted at https://gonzalezlab.shinyapps.io/mspms_shiny/ or downloadable from https://github.com/baynec2/mspms-shiny/.
Results
The mspms R package was developed to provide a dedicated tool to analyze MSP-MS data, focusing on reproducibility, ease of use, and robustness. It includes modular functions to handle key steps generalizable to any MSP-MS analysis: data preparation, processing, statistical analysis, and visualization (Fig. 1).Fig. 1. Overview of the mspms R package. Schematic depicting the structure, key functions, and data flow of the mspms R package.
Data quality evaluation
To assess the quality of the MSP-MS data from the cathepsin A, B, C, and D experiments, we applied the quality control functions of mspms. We found that over 90% of the full-length peptide library was detected in all samples at time zero (T0), and more than 95% of the library, including cleavage products, was detected across the dataset (Supplementary Fig. 2A). Only five peptides from the library were consistently missing across all samples, suggesting high-quality data with minimal loss (Supplementary Fig. 2B).
Evaluation of global data patterns
Next, we examined global patterns in the dataset using principal component analysis (PCA) and unsupervised hierarchical clustering. PCA demonstrated tight clustering of replicates within each experimental group (condition and timepoint), as shown by the 95% confidence intervals surrounding each group (Fig. 2A). Near-perfect clustering of replicates from identical conditions was observed, indicating high experimental consistency. Differential peptide abundance between groups was evident, supporting distinct activity for each cathepsin over time (Fig. 2B).Fig. 2. Global visualization of MSP-MS data. A Principal component analysis displaying PC1 and PC2. Samples are colored by time, while the shape and line type show the type of cathepsin with eclipses representing the 95% confidence interval. B Heatmap showing the results of the experiment as clustered using unsupervised hierarchical clustering. Rows of the heatmap represent the samples while columns represent the peptides. Color of the heatmap cells represent the normalized, column centered, and scaled values. Colored bars to the right of the heatmap indicate the cathepsin and time of the samples in each row. Colored bars corresponding to each peptide in the columns display whether the corresponding peptide is a full-length peptide belonging to the 228-member peptide library (non -cleaved, dark blue) or a cleavage product (cleaved, blue).
Significant peptide changes and cleavage position preferences
For each cathepsin, peptide-level changes relative to T0 were quantified as log₂ fold changes and assessed for significance using limma-based statistical tests. The results, visualized as volcano plots, revealed between 22 and 153 significantly upregulated peptides (log_2_ fold change ≥ 3, adjusted p ≤ 0.05) after the incubation times assessed (Fig. 3B). We then assessed the number of significantly different peptides as a function of time and found that this number increased progressively for each cathepsin, highlighting their dynamic substrate cleavage behavior (Supplementary Fig. 3).Fig. 3. Differentially abundant peptide cleavages over time. A Summarized substrate specificities for cathepsin A–D as reported in the literature. B Volcano plots displaying the log_2_-fold change of the timepoint as indicated by color relative to and -log10 FDR corrected p values for each cathepsin. C Plot showing the number of significant cleavage events at each position of the peptide library (as defined as having a log_2_ fold change ≥ 3 and FDR adjusted p values ≤ 0.05) D IceLogo plots as implemented in the mspms package. Amino acid residues (with X representing positions past the terminus) four positions to the left and right of the cleavage site are displayed. For the iceLogo plots reported, significantly altered peptides (p ≤ 0.05, log₂FC ≥ 3) relative to time 0 were used as the experimental observations, and letter heights represent the percent change (PC) relative to the initial peptide library. Only residues with p ≤ 0.05 are shownz.
To evaluate protease activity relative to reported substrate specificities (Fig. 3A), we investigated cleavage site preferences within the 14-mer peptides. Cathepsin A showed clear carboxypeptidase activity through the high number of cleavage sites at the C-terminus (Fig. 3C) and overrepresentation of X (corresponding to no amino acid) at P2', P3', and P4' (Fig. 3D). Cathepsin B exhibited dipeptidyl carboxypeptidase activity, with 52 significant cleavages at peptide position 12 and 23 at position 10, consistent with sequential removal of dipeptides from the C-terminus (Fig. 3C).The enrichment of X at P3' and P4' evident in the iceLogo plot further supported this dipeptidyl carboxypeptidase activity (Fig. 3D).
Cathepsin C exhibited 33 significant cleavage events at position 2 and 6 at position 4, suggesting sequential dipeptide release from the N-terminus (Fig. 3C). An overrepresentation of X at P4 and P3 in the iceLogo also confirmed the dipeptidyl aminopeptidase activity (Fig. 3D). Lastly, cathepsin D demonstrated endopeptidase activity, as indicated by a peak of 16 significant cleavages centered at position 16 (Fig. 3C). The iceLogo plot showed that X was not enriched at any of the sites from P4 to P4′ further validating this endopeptidase activity (Fig. 3D).
Amino acid preferences
To visualize the amino acid preferences at cleavage sites, we performed an iceLogo analysis using mspms, focusing on the eight positions surrounding the cleavage site (P4 to P4′).
Cathepsin A showed a preference for the removal of hydrophobic amino acids (such as Phe and Leu) from the C-terminus of substrates, when additional hydrophobic residues occupy the P1 position (Fig. 3D). Cathepsin B favored substrates with positively charged (Arg, Lys) or hydrophobic residues in P1 and P2 (Fig. 3D). Cathepsin C showed a preference for basic residues (His, Arg) at the P1 position and for Phe at the P1′ position (Fig. 3D). Cathepsin D showed a preference for Phe, Tyr, and norleucine at the P1 and P1′ position (Fig. 3D).
Comparison of results across different upstream proteomics software
To demonstrate the compatibility of MSP-MS with a range of upstream proteomics software, we analyzed MSP-MS data independently using PEAKS Studio, Proteome Discoverer (PD), FragPipe, and Sage. We then compared the mean log₂-transformed intensities of peptides detected by all platforms, the overlap of statistically significant peptides, and the cleavage profiles inferred from these peptides (Supplementary Fig. 4).
Processed data corresponding to peptides detected across the three closed source approaches correlated well, with R^2^ values of 0.89 (FragPipe to PD), 0.86 (FragPipe to PEAKS), and 0.89 Fragpipe to Proteome Discoverer (Supplementary Fig. 4A). However, when comparing any of these to the results produced by the open source software Sage, we saw dramatically worse correlation, reflected by the following R^2^ values 0.26 (Sage to PEAKS), 0.27 (Sage to PD), 0.28 (Sage to Fragpipe).
Among peptides identified as significantly different from time zero, only 13%, 5%, 10%, and 5% of those associated with cathepsins A, B, C, and D respectively, were shared across all four analytical platforms (Supplementary Fig. 4B). Notably, Sage and FragPipe identified unique peptides more frequently than the other software solutions.
Positional specificity, assessed via cleavage position plots, was highly consistent for cathepsins A, B, and C across all tools. However, Proteome Discoverer (PD) showed a slightly reduced ability to detect endopeptidase activity, as significant cleavages in the central region of peptides were less frequently observed. Sage, on the other hand, exhibited a markedly diminished ability to detect endopeptidase activity, with cleavage specificity plots showing many significant cleavages occurring at the N- and C-termini (Supplementary Fig. 4C).
The specificity profiles at positions P4–P4′, evaluated using iceLogos, were generally comparable across all tools, with subtle software-dependent differences. Major discrepancies were observed for cathepsin D, particularly in the iceLogos generated by Sage compared with the other tools, consistent with findings from the cleavage position plots (Supplementary Fig. 5).
Since cathepsin D was the only endopeptidase included in this study, we hypothesized that the observed discrepancies arose from differences in each tool’s ability to detect shorter peptides, which are more commonly generated by endopeptidases. Analyzing the distribution of significant peptides by length revealed that PD and Sage systematically detected fewer peptides shorter than eight amino acids compared to PEAKS and FragPipe (Supplementary Fig. 6). This likely explains the poorer performance of PD and Sage on cathepsin D compared to the other tools.
Discussion
Before the development of the mspms package, MSP-MS data analysis relied heavily on ad hoc R scripts that were fragmented, poorly documented, and difficult to adapt, raising concerns about reproducibility. Researchers without programming experience struggled to customize these workflows for different experimental designs, limiting the broader utility of MSP-MS. Furthermore, these scripts were specifically designed to process data exported from the proteomic search engine PEAKS, preventing their use across groups employing different platforms.
The mspms R package effectively addresses these limitations through a modular, reproducible, user-friendly design compatible with a wide range of proteomics software. It provides self-contained functions for data preparation, processing, statistical analysis, and visualization, ensuring ease of maintenance, extensibility, and usability.
Notably, mspms integrates functionality the widely cited iceLogo tool within R, allowing analysis of nonstandard amino acids, such as norleucine, and positions marked by “X.” A graphical user interface, accessible both online and via local download, enables researchers without R programming experience to leverage the core functionalities. Additionally, by employing established S4 classes internally, mspms integrates smoothly with the Bioconductor ecosystem, allowing advanced data exploration, statistical analyses, and visualizations. Together, these features make mspms a versatile tool that meets the diverse needs of the protease research community.
In this study, we applied mspms to profile the substrate specificity of four well-characterized cathepsin proteases: cathepsin A, B, C, and D. This application highlights features of mspms that are broadly applicable to any MSP-MS experiment, while providing rigorous benchmarking by evaluating its ability to detect substrate specificities previously established as ground truth.
A critical but often overlooked step in proteomic analysis is conducting a thorough quality control assessment to ensure data quality is sufficient for drawing biologically meaningful conclusions [31]. Since each MSP-MS experiment is based on a known peptide library, an effective quality control measure involves evaluating the percentage of the un-cleaved peptide library detected in each sample. Ideally, 100% of the un-cleaved peptide library should be detectable at T0; however, due to limitations in mass spectrometry performance, this is rarely achieved in practice. When we applied this quality control check to our cathepsin experiment, we observed no indication of data quality issues, confirming the reliability of our results.
Next, global data exploration allows evaluation of technical success and identification of meaningful patterns. mspms facilitates this with PCA and interactive heatmap plots. In our cathepsin MSP-MS experiment, PCA and heatmap analyses revealed tight clustering among replicates, indicating minimal variability, and clear separation of experimental groups by cathepsin type and time point, highlighting distinct substrate specificities. These visualizations reinforce both the reliability of the experiment and the capacity of mspms to support robust data exploration.
Once technical success is established, substrate specificity can be determined efficiently using mspms. The software first computes log₂ fold changes and FDR-corrected p-values using limma applied to normalized and imputed intensity values. Significant peptides are then visualized using cleavage location plots to illustrate positional specificity and iceLogo plots to reveal amino acid preferences.
Application to the cathepsin dataset accurately recapitulated previously reported substrate specificities:
- Cathepsin A is a carboxypeptidase that preferentially removes hydrophobic amino acids (such as Phe and Leu) from the C-terminus of substrates, especially when additional hydrophobic residues occupy the P1 position, as previously reported [32].
- Cathepsin B is a dipeptidyl carboxypeptidase that cleaves dipeptides from the C-terminus, favoring substrates with positively charged (Arg, Lys) or hydrophobic residues in P1 and P2, as previously reported [33].
- Cathepsin C functions as a dipeptidyl aminopeptidase, cleaving dipeptides from the N-terminus with broad specificity, as previously reported [10].
- Cathepsin D is an endopeptidase that cleaves between hydrophobic amino acids, including Phe, Leu, and Tyr, as previously reported [34].
These results validate mspms’s ability to accurately identify expected substrate specificities. Beyond these enzymes, the package’s modular design allows it to analyze virtually any protease mixture, supporting diverse applications across the protease research community.
mspms is also designed to accommodate future advancements in MSP-MS assays. As peptide synthesis becomes more cost-effective and mass spectrometer technology advances, larger peptide libraries beyond the current 228-member set can be incorporated. Its modular architecture, reproducible workflows, and user-friendly features make mspms a versatile and enduring tool for protease research.
To enhance accessibility, mspms is compatible with four major proteome search engines: PEAKS Studio, Proteome Discoverer (PD), FragPipe, and Sage. The suitability of each platform for MSP-MS data analysis was validated by independently analyzing the cathepsin A–D datasets with each tool. Substrate specificity profiles for cathepsins A, B, and C were highly consistent across platforms. For cathepsin D, both PEAKS and FragPipe effectively detected endopeptidase activity, although FragPipe identified a higher frequency of significant N-terminal cleavages. Determining which profile is biologically most accurate would require orthogonal validation. Proteome Discoverer, in contrast, struggled to detect significant cleavages in the interior of peptides, though it produced an iceLogo motif plot largely consistent with the other tools. Sage performed poorly in detecting cathepsin D endopeptidase activity and generated a divergent iceLogo motif compared with the other platforms.
Based on these results, we recommend PEAKS Studio or FragPipe for experiments focused on endopeptidase activity. FragPipe is particularly appealing because it is freely available for academic use and demonstrates analysis speeds at least an order of magnitude faster than paid software solutions. Sage should be used with caution under the search settings applied here, unless optimized for smaller peptide detection. For studies where endopeptidase activity is not the primary focus, all four platforms provide consistent biological interpretations.
Given the substantial variability in search engine performance and configuration on MSP-MS data, standardized benchmarks are essential for assessing suitability. We therefore recommend the cathepsin dataset as a reference for implementing or validating new parser functions or search engine configuration settings for use with mspms.
Despite its advantages, mspms has limitations. First, it currently operates on result files from specific proteomic search engines. While we have incorporated parsers for several commonly used platforms, compatibility is not universal. However, because the input structure is standardized, new parser functions can be readily implemented to support additional search engines in future releases, leveraging the core functionality of the mspms platform.
Second, mspms currently supports the Multiplex Substrate Profiling by Mass Spectrometry (MSP-MS) workflow but does not yet incorporate the quantitative TMT-based extension (qMSP-MS), which has been shown to minimize experimental and instrument-derived variance while improving assay throughput [35].
Finally, mspms is currently limited to DDA-based acquisition strategies. Extending compatibility to data-independent acquisition (DIA) formats represents an important future direction. Implementing DIA-based workflows for MSP-MS could substantially shorten chromatographic gradients, reduce missing-at-random values, improve overall throughput, and decrease instrument time.
Conclusion
In summary, mspms streamlines MSP-MS data analysis, providing a reliable, reproducible, and adaptable platform for protease substrate profiling. The combination of its powerful analytical capabilities and intuitive design enables researchers to extract biologically meaningful insights from complex datasets with minimal technical barriers. Given its flexibility and broad applicability, mspms is positioned to become a standard tool in protease research, offering significant advancements in the study of proteolytic enzymes and their roles in health and disease.
Supplementary Information
Supplementary Material 1 Supplementary Material 2
The reference list from the paper itself. Each links out to its DOI / PubMed record.
