Hybrid Deep Learning Model for EI-MS Spectra Prediction

Bartosz Majewski; Marta Łabuda

PMC · DOI:10.3390/ijms27031588·February 5, 2026

Hybrid Deep Learning Model for EI-MS Spectra Prediction

Bartosz Majewski, Marta Łabuda

PDF

Open Access

TL;DR

This paper introduces a hybrid deep learning model that predicts EI-MS spectra from molecular structures, improving library coverage and reducing the need for experimental data.

Contribution

A novel hybrid deep learning model combining graph and residual neural networks for EI-MS spectra prediction.

Findings

01

The model achieves a Recall@10 of ≈80.8% on the NIST14 database.

02

Generated spectra show high similarity to real ones, aiding library augmentation.

03

Remaining challenges include generalization and ensuring spectral uniqueness.

Abstract

Electron ionization (EI) mass spectrometry (MS) is a widely used technique for the compound identification and production of spectra. However, incomplete coverage of reference spectral libraries limits reliable analysis of newly characterized molecules. This study presents a hybrid deep learning model for predicting EI-MS spectra directly from molecular structure. The approach combines a graph neural network encoder with a residual neural network decoder, followed by refinement using cross-attention, bidirectional prediction, and probabilistic, chemistry-informed masks. Trained on the NIST14 EI-MS database (≤500 Da), the model achieves strong library matching performance (Recall@10 ≈ 80.8%) and high spectral similarity. The proposed hybrid GNN (Graph Neural Network)-ResNet (Residual Neural Network) model can generate high-quality synthetic EI-MS spectra to supplement existing libraries,…

Figures7

Click any figure to enlarge with its caption.

Funding1

—Department of Theoretical Physics and Quantum Information at Gdańsk University of Technology

Keywords

electron ionization mass spectrometryEI-MS spectrum predictiongraph neural networksdeep learningspectral library augmentationmass spectrometry databases

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMass Spectrometry Techniques and Applications · Metabolomics and Mass Spectrometry Studies · Computational Drug Discovery Methods

Full text

1. Introduction

Mass spectrometry is an important analytical tool for identifying chemical substances and providing indications about their structure and functional groups. A commonly used experimental technique is gas chromatography/electron ionization-mass spectrometry (GC/EI-MS). Particularly, in the EI-MS ionization technique, molecules are bombarded with high-energy electrons (typically 70 eV), causing them to lose an electron and form a radical cation M^+•^. The high energy also generates extensive fragmentation, producing a characteristic pattern of fragment ions. It is a highly reproducible and cost-effective version of MS, and widely used in many fields, such as analytical chemistry, pharmacokinetic studies [1], medicine [2], proteomics [3], as well as metabolomics [4]. With the constant increase in the instrumentation development, the popularity and broad applicability of GC/EI-MS have caused sustained growth in the size of spectral databases. Experimentally acquired spectra can be queried against such databases to identify compounds and infer molecular structural properties [5,6]. Despite continuous expansion, database coverage remains limited because many compounds have not yet had their spectra measured; the absence of a reference spectrum for a target compound can therefore lead to misidentification. This creates a need for commercial spectral databases to regularly release updates. Consequently, commercial spectral libraries require periodic updates, a process that depends on the time and resources available for acquiring new spectra. For example, one that is widely used is the National Institute of Standards and Technology (NIST) Mass Spectra Library [7], adding roughly only 40,000 spectra to its EI-MS reference collection every three years. However, these updates tend to favor molecules of broad interest, so newly synthesized or less-studied compounds are often absent or underrepresented. Augmenting existing reference libraries with synthetic spectra generated by computational models is a promising approach to mitigate these limitations.

EI-MS spectrometry remains one of the most widely used analytical techniques for structural characterization, yet accurately predicting EI mass spectra from first principles is challenging because the process involves ionization and multi-step fragmentation of the ions [8]. Over the years, multiple computational approaches have been developed to address different aspects of the problem. Reaction-network and kinetics-based approaches (e.g., Rice–Ramsperger–Kassel–Marcus (RRKM) theory) approximate fragment abundances by mapping stationary points on the potential-energy surface and evaluating microcanonical rate constants [9]. High-level ab initio wavefunction methods such as Equation-of-Motion Ionization Potential Coupled Cluster (EOM-IP-CCSD) provide accurate ionization energies and electronic-state characterization, forming the foundation for modeling the initial ionization event [10]. Fragmentation can be simulated either through ab initio or semi-empirical molecular dynamics such as Born–Oppenheimer molecular dynamics (BOMD) or semi-empirical tight binding methods (e.g., GFN-xTB), which dynamically explore bond dissociations and rearrangements under EI-like energy conditions [8]. Collectively, these approaches offer a spectrum ranging from high-accuracy to high-throughput methods, enabling the increasingly realistic and systematic prediction of EI mass spectra.

RRKM theories explicitly model the redistribution of energy over the internal degrees of freedom and allow the estimation of the rate constants for the ionization reaction [11,12,13,14]. To apply this theory, we assume an ensemble of molecules that have been excited to a state in which they carry a fixed amount of energy E, a portion of which is in rotational motion, while the rest is stored in vibrational modes. According to RRKM theory, the number of possible reaction pathways is determined by the density/sum of the vibrational states. As the number of vibrational modes increases, the number of possible states grows rapidly. This limits the application of such theories to small molecules, as the count of vibrational modes increases with the size of the molecule [9].

Another line of predicting mass spectral fragmentations is to directly predict where fragmentation may occur within the molecule. There are approaches which do not use statistical theory and instead perform quantum chemical calculations to map stationary points on the potential-energy surface (PES). By examining reaction energies and molecular structures, these methods allow for the qualitative prediction of the primary decomposition pathways for molecular ions. There have been approaches that declared that EI mass spectra of simple organic compounds could be predicted through calculations of bond orders and partitions of Hamiltonians as in the work of Mayer and co-workers [15,16]. Additional approaches rely on the Binary-Encounter Bethe (BEB) formalism to map the deposition of excess energy in a molecule and to compute the corresponding internal energy distribution of the molecular ions [17].

On the other hand, methods based on BOMD allow for modeling the time evolution of the fragmentation event. In this approach, an EI-induced fragmentation is simulated by propagating nuclear motion on an electronic potential surface that is recalculated at each timestep. In this framework, the fragmentation event is represented by preparing the neutral molecule with appropriate excess energy. Due to the presence of excess energy, the system naturally explores possible bond cleavages and rearrangements that are consistent with underlying quantum-mechanical forces. A relatively recent example of ab initio is the QCxMS, a quantum chemical (QC)-based program that enables users to calculate mass spectra using Born–Oppenheimer molecular dynamics. The related software is developed by the group of S. Grimme [8,18,19]. Such methods enable the statistical sampling of molecular-dynamics trajectories. In consequence, multiple trajectories yield a distribution of possible fragments. Aggregation and renormalization of the fragments across all trajectories produces a simulated EI-MS spectra. The main limitations of these methods arise from their high computational cost. Because BOMD evaluates energies and gradients at a chosen electronic-structure level (one of the most time-efficient approaches with acceptable accuracy is the Density Functional Theory (DFT); however, in principle it usually is a higher level method, such as the post–Hartree–Fock methods: MP2 or CCSD), it is computationally demanding, requiring many long trajectories for statistical convergence, and may still struggle to capture rare events or non-adiabatic effects relevant to certain EI processes. Even with access to high-performance computing units, simulations may require several hours to generate a single trajectory for a medium-sized molecule. Such time constraints prevent the rapid generation of large numbers of spectra. Moreover, although these methods can often identify potential fragmentation pathways, their overall accuracies are frequently insufficient for reliable compound identification [20].

Data-driven methods for MS have been explored for a long time. Interest in computational models for predicting and identifying compounds from their mass spectra dates back to the 1960s with the DENDRAL project, which began in 1965 and produced the first well-documented expert system for assisting organic chemists in structure elucidation from mass spectra [21,22]. Since then, many algorithms and machine learning models have been developed for a range of MS-related tasks, including spectrum prediction, peak annotation, and spectral library searching [23,24,25,26,27,28,29].

Several models have been specifically designed for the direct prediction of EI-MS spectra. Competitive fragmentation modeling (the CFM family) introduced a probabilistic, rule-informed fragmentation model and has been developed both for tandem MS (CFM-ID) and for EI spectra (CFM-EI) [30,31,32,33,34]. CFM models use a tree-based probabilistic generative architecture, where the fragmentation of a molecule is modeled as a fragmentation tree that recursively represents how a parent molecule breaks into smaller pieces, predicting both mass/charge values and intensities. Notable examples of neural approaches for EI-MS prediction include NEIMS (Neural Electron–Ionization Mass Spectrometry) and RASSP (Rapid Approximate Subset-based Spectra Prediction). NEIMS uses extended-connectivity fingerprints as input to a feed-forward network with a bidirectional prediction mode that is the most characteristic feature of the NEIMS’ architecture [35]. The bidirectional prediction combines forward and reverse predictions to better capture neutral fragment losses. RASSP combines substructure enumeration with graph-based deep learning [36]. The RASSP model architecture is much more sophisticated than NEIMS; model preparation employs extensive subformulae and a subset generation algorithm in order to gather additional training data. It is divided into two independent models, one based on chemical formulas and the second on subsets of molecular fragments. It uses a common graph neural network encoder which learns atom representations which are later used for predicting a probability distribution over atom subsets and chemical subformulae. The resulting spectra are a weighted sum of probability-based intensities of given atom subsets and subformulae.

Many molecule-related problems have been addressed using graph neural networks [37,38], for example to create molecular representations [39,40] instead of relying on handcrafted fingerprints such as Morgan’s fingerprints [41]. This strategy is implemented, for example, in RASSP, as well as in models that aim to improve the NEIMS architecture [42], and others [43,44,45,46]. It has been shown that GNN architectures are capable of learning molecular representations that are often competitive with handcrafted ones, while substantially reducing the dimensionality of learned representations compared to fingerprints such as Morgan’s [47,48,49,50,51]. Handcrafted fingerprints also face the issue of bit collision due to the limited number of available bits. Increasing fingerprint dimensionality can reduce collisions, but learned dense embeddings, such as those produced by GNNs or other machine learning models, overcome this problem by allowing each substructure, atom, or bond to contribute directly to a real-valued representation rather than being forced into a fixed bit slot [52]. Recent GNN-based studies in drug–biological system modeling and mechanistic prediction have demonstrated the effectiveness of tailored message-passing schemes and graph-level embeddings for capturing complex, structured biochemical interactions [53,54], providing a broader perspective for the design of our GNN encoder and its application to modeling EI-MS fragmentation behavior.

The currently available models struggle with rare or complex fragmentation patterns and are often biased due to the underrepresentation of certain chemical classes in widely used spectral libraries. Moreover, most models with non-trivial architectures are closed solutions, making direct extension difficult or inefficient. In this work, we present a model that exploits the molecular representation learning capabilities of GNNs and leverages a ResNet-based decoder to accurately map these learned embeddings to predicted the EI-MS spectra, capturing complex nonlinear relationships between molecular structure and fragment intensities. Furthermore, in contrast to previous studies, the proposed model incorporates a dedicated refinement stage that explicitly injects physical and chemical knowledge through mass- and neutral-fragment-guided masks. This refinement framework is inherently modular and can be extended with additional mask-based components to further enhance the fidelity of the final predicted spectra. The model additionally integrates a cross-attention mechanism and a bidirectional prediction scheme inspired by NEIMS [35]. The flexibility of the refinement stage enables the accommodation of edge and exotic cases, including unusual fragmentation pathways or rare chemical motifs, by incorporating supplementary mask-based knowledge. In addition, we adopt a modified Recall@k evaluation strategy based on spectral embeddings. Specifically, our approach leverages spectral projections of both true and predicted spectra, which are integrated into the model training process through the inclusion of the retrieval-based loss component in the objective function. This formulation promotes greater uniqueness in the generated spectra, leading to improved Recall@k performance. Unlike prior EI-MS prediction studies, we conduct systematic ablation studies across multiple random seeds to assess model robustness and quantify the contribution of individual architectural components. Moreover, we provide a more detailed characterization of model performance by analyzing results across different molecular-weight bins. Finally, we provide quantitative measures of chemical diversity within the employed datasets by assigning molecules to coarse molecular superclasses and chemical classes.

2. Results and Discussion

Training pipeline. To examine the performance of the model, we used the NIST14 EI-MS database. During training, the main library (mainlib) was used. It was ensured that the molecules present in the replicate library (replib) had been removed from the mainlib before the model was trained to prevent information leakage. With removal of replicate molecules and enforcing SMILES (Simplified Molecular Input Line Entry System) uniqueness, the main library dataset amounted to 201,187 molecule–spectra pairs. The dataset was divided into training and test sets with 80/20 ratio. The performance of the model was analyzed across ten different random seeds to quantify its statistical variability, with the following random generators set:

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Dataset splits were obtained using a separate seed value of 42, with the same split used in all ten experiments for which the performance was tested. However, during training, the shuffling of the training set was enabled. The above approach was also used to perform ablation studies of the main components of the model, to determine their importance in operation of the model. Ablation studies consisted of checking the following model variants:

disable_attention—baseline model with disabled cross-attention,
replace_resnet_with_linear—baseline model with ResNet replaced by a simple linear neural network,
fix_alpha_forward—baseline model with only forward prediction, reverse prediction disabled in bidirectional prediction mode,
replace_gnn_with_pool_mlp—baseline model with MPNN (Message Passing Neural Network) encoder replaced by simple aggregator and MLP (Multi-Layer Perceptron),
disable_mask—baseline model with learnable probabilistic mask disabled.

The average values of the chosen metrics, together with their standard deviations, for all ten seeds are given in Table 1.

A more detailed distribution of the most important metrics is presented in Figure 1. The results indicate that the baseline model achieves the best performance within the statistical variability. It means that all the components verified through the ablation studies contribute positively to the final architecture. Furthermore, the box plots in Figure 1 reveal that the bidirectional prediction module is the main source of instability, as removing the reverse-prediction pathway reduces the standard deviation of all evaluated metrics across random seeds. The probabilistic mask exhibits similar behavior, as only individual random seeds deviate substantially from the average metric values. However, it should be noted that the two modules do not influence the model in exactly the same way. This is proven by the apparent difference in similarity values presented in Figure 1c. The third ablated component that reduces the variance of the metrics is the MPNN encoder. However, in contrast to bidirectional prediction and the probabilistic mask, this reduction is attributable to a substantial degradation in model performance. Replacing the MPNN encoder with a simpler MLP restricts the model’s learning capacity, resulting in consistently poor metric values across all random seeds. While the MPNN encoder is the most important overall, cross-attention and the probabilistic mask also show high importance in achieving a high cosine similarity between predicted and measured spectra (Figure 1c). This is an interesting result since one might assume that a strong influence on cosine similarity values would directly affect the Recall@k values. Figure 1d shows that this is not the case, as ablating the cross-attention and probabilistic mask modules only marginally degrades the Recall@10 values. These results suggest that high cosine similarity is not a strong determinant of Recall@k performance and provide evidence for a partial decoupling between spectral reconstruction quality and retrieval performance. This indicates that these modules primarily improve fine-grained spectral fidelity rather than the discriminative structure of the embedding space used for library matching, and that the retrieval head is robust to spectral imperfections that strongly impact cosine similarity.

From the ten random seeds used to evaluate the baseline model, the seed yielding the highest Recall@10 value was selected for subsequent, detailed performance analysis. The training dynamics of the chosen seed are shown in Figure 2.

Library matching task. The next stage of evaluation involved assessing the spectra predicted by the trained model within a library matching framework. The protocol used to quantify the matching accuracy of the predictions followed the procedures described in [35,42].

The predicted spectra for the molecules present in the replicate library were used to create an augmented spectral library by adding them to the main library. Following the reduction of the replicate library to retain only unique SMILES, 20,265 samples remained, yielding an augmented library containing 221,452 entries. The ground truth spectra of replib molecules were used as queries. Each query was compared with the augmented library, and their pairwise similarity was computed. The highest k similarity values for $[eqn]$ were saved for each query molecule.

In contrast to previous studies, in our model, similarity was not computed directly between the raw spectra and their predicted counterparts. Instead, comparisons were performed in the learned embeddings using the outputs of the projection head. The similarity measure used to compute similarities between the embeddings was quantified using the standard dot-product. In this case, all embeddings were L2-normalized; therefore, the dot-product reduces to simple cosine similarity. Moreover, during the retrieval of the top-k values and the evaluation of Recall@k, a mass filter was used, following the schemes adopted in refs. [35,36]. In the calculation of Recall@k, the filter considers only molecules of mass within $[eqn]$ Da of the mass of the query molecule. The model achieved following Recall@k values for the library matching task: 80.8% Recall@10, 67% Recall@5, and 25% Recall@1.

The Recall@10 and Recall@5 values in the library matching task are close to the levels achieved by other EI-MS prediction models. Our Recall@10 and Recall@5 results differ from previous models by 5.9–12.7% from the declared values. However, the decline in Recall@1 is substantially more pronounced than in comparable models. This indicates, that the spectra generated by the proposed model may lack sufficient uniqueness and that its capacity for generalization may be limited -both factors contribute to the final result. Molecules in the replicate library exhibit limited structural similarity to those in the training set, a trend observed across multiple versions of the NIST EI-MS library [35,36,42]. We verified it for NIST14 by Tanimoto similarity calculations and found that only 23.6% of replib molecules have a counterpart in mainlib with a similarity of 0.8 or higher. The bad generalization assumption would mean that the model learned to predict spectra that capture structural dependencies of molecules with structures mostly similar to those present in the training set. This would partially explain why Recall@1 is worse in library matching with replicate library than during the training, while Recall@10 and Recall@5 are very similar. The model proposed in this work shares similar core components as the model presented in [42]: GNN encoder, ResNet decoder and bidirectional prediction. Our model additionally incorporates the cross-attention and probabilistic mask mechanisms, as well as a recall-oriented loss function. Consequently, the two models are expected to exhibit at least comparable performance. This expectation is largely confirmed by the key performance metrics, which differ only by a few percentage points. Our model achieves the Recall@10 and Recall@5 values lower by 5.9% and 12.1%, respectively. Moreover, it is important to emphasize that our model was trained on the NIST14 library, which exhibits substantially greater chemical diversity than the NIST05 library used by Zhang and co-workers [42]. This discrepancy is reflected, for instance, in the size of the atomic number one-hot encoding: the NIST14 database contains 17 additional atom types present in the molecular spectra. The higher molecular diversity in NIST14 increases the complexity of the learning procedure and makes achieving the same level of generalization more challenging.

Limited spectral uniqueness may also contribute to the reduced performance. This concern partially motivated the inclusion of a retrieval loss component to the total loss function and the incorporation of a projection head into the model. During the development of the model, we observed that training only against an MSE-type loss caused the model to predict spectra that exhibited high similarity scores with many candidate molecules simultaneously. Because the model cannot produce perfectly accurate spectra, those distractor spectra could have been chosen over the spectra for the specific molecule for which the prediction was made. This behavior suggests that either (i) the representations of molecules produced by the MPNN encoder were insufficiently separated in latent space, or (ii) the structural model components such as the learnable probability mask, intended to encode group-specific fragmentation characteristics, may have induced excessive similarity across the predicted spectra.

The incorporation of a cross-entropy-based loss function, a projection head, and expanded one-hot feature encodings was designed to maximize the separation of molecular embeddings within the representation space. This objective was successfully achieved since these modifications proved effective, yielding substantial improvements in recall performance. Specifically, the highest Recall@10 observed prior to introducing the retrieval loss was approximately 45%. It is also worth noting that applying a mass filter during the Recall@k evaluation produced further improvements, increasing Recall@k by 10–25% depending on the random seed and the value of k.

Average similarities. Additionally, we analyzed the distribution of cosine similarity values between the ground truth and predicted spectra for the replicate library, considering both the raw and embedded spectra (Figure 3). For raw spectra, the average cosine similarity was 0.74, with approximately 79% of values exceeding the 0.6 threshold, which is almost on par with the results achieved by [42]. For the embedded spectra, the average similarity was 0.69, with about 77% of the values above 0.6. From the cosine similarity value distributions, it is evident that although the average is high there are still many spectra with low similarity scores. This once again implies that the model is capable of creating spectra that on average agree with the training data but struggles with examples that deviate from said average. An interesting result is that despite the lower average similarity values of the embeddings, the scheme using them achieves drastically better recall values, which validates their use, as well as suggesting that higher separation of data points in the representation space at the expense of lower average similarities might be beneficial when it comes to Recall@k values in library matching tasks.

Models comparison. To further validate the performance of the proposed model, we compared it with two pretrained NEIMS and RASSP SubsetNet models, available on their respective public repositories. Firstly, it should be noted that SubsetNet cannot predict spectra for the entire replib because it is restricted to molecules composed only of the elements {H, C, O, N, F, S, P, Cl} and by limitation in subset enumeration [36]. As a result, SubsetNet is unable to generate predictions for the full replicate library, reducing the applicable subset to 17,357 molecules. We refer to this as “RASSP split replib” (RS replib).

To ensure a fair comparison, both the proposed model and NEIMS were evaluated on the full replib as well as on the RS replib, whereas SubsetNet was evaluated exclusively on the RS replib. For all models, the predicted spectra were generated using the corresponding pretrained networks and subsequently incorporated into the library-matching task following the same evaluation protocol as applied to the proposed model. Performance was quantified using Recall@k metrics, and average cosine similarity was additionally reported. To provide a more detailed analysis, Recall@k values were computed separately for three molecular-weight bins: small (1–175 Da), medium (176–324 Da), and large (325–500 Da). A summary of all the results is presented in Table 2, where the proposed approach is denoted as HYBRID.

SubsetNet achieves the highest overall performance among the evaluated methods, with the exception of the largest molecular-weight bin. The proposed HYBRID model consistently surpasses NEIMS on nearly all Recall@k metrics, the only deviation being Recall@1 for large molecules. It also exhibits a modestly lower average cosine similarity across the considered mass ranges. It is important to emphasize that SubsetNet is trained on a smaller and more chemically restricted dataset, excluding inorganic species and limiting overall chemical diversity. Moreover, SubsetNet employs a substantially larger architecture and relies on a computationally expensive bond-breaking-based subset enumeration procedure, which increases inference time and limits scalability for large libraries. In contrast, the proposed HYBRID model attains competitive retrieval performance while maintaining a more compact architecture, faster inference, and broader chemical coverage, including support for inorganic atoms. These characteristics render HYBRID a more versatile and scalable solution for large-scale EI-MS library augmentation, despite SubsetNet achieving higher performance within its narrower applicability domain.

It should also be noted that EI-MS prediction models are known to be sensitive to experimental setup and spectral comparison methodology [55]. The evaluation protocol employed in this work differs from that used in the original NEIMS study [35], which likely contributes to the observed reduction in NEIMS performance relative to previously reported values. Notably, a similar decrease in Recall@1 performance is observed for both NEIMS and HYBRID. While this may suggest an influence of the spectral projection-based retrieval scheme, this effect is not observed for SubsetNet, indicating that the reduced Recall@1 performance in NEIMS and HYBRID cannot be attributed solely to the retrieval formulation. Instead, it likely reflects broader factors, including increased chemical diversity and the inclusion of inorganic species. A comprehensive assessment of these effects would require retraining HYBRID and NEIMS on SubsetNet’s restricted dataset and recomputing all evaluation metrics; however, this dataset is not publicly available, and such an analysis is therefore beyond the scope of the present study.

Examples of the spectra predicted by HYBRID for different values of cosine similarities are depicted in Figure 4 for the chosen molecules with the different size and chemical properties. Acetylene $[eqn]$ 4-chloro-3-nitrobenzophenone $[eqn]$ ) and 1,2,3,4-tetraphenyl-1,4-butanedione $[eqn]$ ) were chosen to validate the efficiency of the hybrid model. In 2 out of 3 of the presented spectra examples the most intensive peak is placed correctly, with the third example having two of the top peaks switched. Overall peak placement is more accurate for heavier molecules; the intensities of the peaks differ from the measured mass spectra, but they mostly retain their proportions in terms of order from highest to lowest intensities.

Moreover, in Figure 5 we present the spectra predicted by the three compared models alongside the experimental spectrum for the molecule with the highest similarity value in Figure 4. The spectrum predicted by NEIMS immediately stands out: it contains many more visible peaks, and although the locations of the most intense peaks are mostly correct, their intensities deviate substantially from the experimental values. Further inspection of other NEIMS predictions reveals this to be a recurring pattern; its spectra are typically much busier than their experimental counterparts.

In contrast, the predictions of HYBRID and RASSP SubsetNet are much closer to the experimental spectrum. The positions of the main peaks are accurate, and their intensities are reproduced more faithfully than in NEIMS. The HYBRID model better captures the intensity of the secondary peak, although it slightly overestimates the tertiary peak. RASSP SubsetNet shows the opposite behavior: it estimates the tertiary peak well but underestimates the secondary one.

This comparison also partially illustrates why cosine similarity alone is not a sufficient metric. Both HYBRID and SubsetNet achieve a cosine similarity of 0.98 with the experimental spectrum, whereas NEIMS reaches only 0.75. One might expect a 75% agreement to produce a spectrum closer to the experimental one than that shown in Figure 5c. However, this example—near the average similarity of NEIMS—highlights the limitations of relying solely on cosine similarity.

Based on the comparative results, the HYBRID framework demonstrates the ability to partially alleviate the coverage limitations of existing reference mass spectral libraries.

3. Materials and Methods

We propose a hybrid deep learning model architecture that can be divided into three main compartments—encoder, decoder, and refinement—as presented in Figure 6. It shows a step-by-step procedure, starting from conversion of isomeric SMILES strings to graphs which are fed into the GNN encoder. The encoder learns effective molecule representations, which are later decoded by a residual neural network. The spectra proposed by ResNet are further refined by cross-attention, probabilistic mask and bidirectional prediction. The primary aim of the model is to provide accurate EI-MS predictions for molecules in the 1–500 Da range, that can be used to augment existing spectral libraries. Furthermore, the main advantage of the structure of this model is that its architecture was intentionally designed with the objective of supporting possible additional mask-like modules. Those structures aim to improve model’s performance within the chosen data distribution or generalization to out-of-distribution examples.

3.1. Feature Extraction

The sole input to the model is isomeric SMILES strings. Using the RDKit 2024.03.6 package [56], they are processed into graphs, from which simple atom and edge features are extracted using basic package functionalities. Created graphs do not contain nodes representing hydrogen atoms; we use the number of hydrogens attached to each node as an atom feature. The atom and edge features used are presented in Table 3. One-hot encodings are used to represent the features to separate the embeddings of molecules of similar structure as much as possible. The features used for the atoms and bonds are almost the same as in [42], with the difference that we used explicit valence as can be seen in Table 3.

3.2. Datasets

During the development of the model, the NIST14 spectral database was used. The NIST EI-MS spectral database contains two separate libraries: main and replicates library. From the EI-MS fraction we selected spectra of molecules with mass equal or less than 500 Da. The mass was limited to 500 Da because most of the molecules present in the library are lighter than that, making heavier molecules underrepresented, which could have significantly worsened the performance of the model, especially because accommodation of all molecules present in the dataset would have required expanding the dimension of predicted spectra from 500 to over 1600, making most entries even more sparse and causing signal dilution. Since the database does not contain SMILES strings, they were collected by querying the PubChem database [57] with InChIkeys present in the NIST14 EI database. Isomeric SMILES can be ambiguous, that is, the same SMILES may represent molecules with identical graphs but different exact 3D coordinates. The exact coordinates influence specific fragmentation patterns, and although NIST14 provides 3D coordinates, the geometry of each molecule during spectrum measurement may differ. This is why we did not use the 3D coordinates as features. While the geometries are optimized, NIST cautions against using them for critical applications. We consider representing geometries via SMILES sufficient and a more robust solution. To avoid duplication issues, only the first occurrence of each SMILES is retained in the final datasets.

We analyzed the chemical diversity of the libraries with a simple algorithm that leverages the RDKit substructure-matching capabilities [56]. A curated set of SMARTS (SMILES Arbitrary Target Specification) patterns was used to detect functional groups, and molecules were grouped according to a coarse-grained version of the classification schemes in [58,59]. Figure 7 shows the percentage of molecules assigned to each superclass and class for both mainlib and replib. Moreover, we computed the Tanimoto similarity of ECFP4 fingerprints between replib and mainlib. Only $[eqn]$ of replib molecules have a counterpart in mainlib with a similarity of 0.8 or higher.

3.3. Model’s Architecture

3.3.1. GNN Encoder

In the first stage, the model uses a graph neural network to learn molecular representations, encoding each molecule’s structure into a graph-level embedding. The graph encoder is based on the PyTorch Geometric 2.6.1 implementation of the MPNN [60]. Message passing in MPNN can be described as

[eqn]

where ⊕ denotes a differentiable, permutation invariant function; here, it is a sum. $[eqn]$ and $[eqn]$ denote differentiable functions; in this case they are MLPs.

3.3.2. ResNet Decoder

The ResNet decoder is based on a shallow residual network architecture. It takes the output of the graph encoder as input and produces initial, unrefined spectra. It is made of one ResNet block and three projection layers that scale up the block output to 512 dimensions, then reduce it to 256 and finally give an initial prediction of dimension 500. Upscaling increases the feature space so the network can represent more complex patterns. The downscaling that follows compresses the rich features into a more compact, informative representation. This helps the model focus on the most important signals and reduces overfitting. Between each linear layer, we used layer normalization. Hyperparameters of ResNet block and other parts of the model can be seen in Table 4. Spectra are predicted at integer resolution, where each index in the predicted output corresponds to an integer $[eqn]$ value, with the assumption that index 0 corresponds to m/z = 1.

3.3.3. Initial Prediction Refinement

Cross-attention. Refinement of the initially predicted spectra begins with a cross-attention module that dynamically reweights spectral regions according to the molecular context, thereby emphasizing chemically meaningful peaks. In this module, the graph-encoder output serves as the query, whereas the keys and values are derived from the initial spectral prediction. The dimensions of queries, keys, and values are matched by projecting all three to the same dimension using linear layers (this dimension is equal to the dimension of graph encoder output). This mechanism enables the model to adaptively highlight fragmentation patterns that are most consistent with the underlying molecular structure.

Bidirectional prediction. Following the cross-attention module, a modified bidirectional prediction scheme based on the one introduced by the NEIMS model [35]—is applied. In contrast to the original formulation, we omit the small mass-shift that allows fragment peaks to exceed the molecular mass due to isotopic contributions, thereby simplifying the prediction process. Bidirectional prediction is applied to the refined spectra, enabling the model to capture complementary fragmentation pathways: some molecules predominantly fragment in a forward direction, while others exhibit complex or reverse fragmentation behavior. By jointly learning both directions, the model can adaptively select the most appropriate fragmentation representation for each molecule.

Probabilistic mask. In the next step, the predicted spectra are processed by a probabilistic mask, composed of the following components: position-dependent probabilities, mass dependent scaling and neutral loss pattern.

Position-dependent probabilities learn which mass regions are generally more likely to contain peaks. It captures universal fragmentation patterns and provides foundation that works across different molecule types.

Mass-dependent scaling adjusts probabilities based on molecular weight. Depending on the size of molecules, it can make peaks more concentrated in lower m/z regions or have them spread across a wider range. This scaling considers the position of the peaks relative to the molecular mass rather than their absolute positions.

Neutral loss patterns account for commonly lost fragments during electron-induced fragmentation. For example, alcohols tend to lose $[eqn]$ , and amino acids often lose $[eqn]$ . We explicitly incorporate these frequent neutral losses as a list of fragments and enhance their peaks with Gaussian boosts, slightly increasing the intensity of neighboring peaks. This adjustment accounts for measurement uncertainty and isotope patterns, while also smoothing gradients during model training. Gaussian parameters are predicted from graph embeddings using two shallow MLPs: one for the amplitude and one for the width (standard-deviation-like). Amplitudes are constrained to the (0, 1) range via a sigmoid, while widths are processed through a softplus function with a + 1 shift to prevent excessively small values. Gaussians are centered at the index corresponding to each neutral loss. The resulting Gaussians are not normalized, as normalization would cause a highly uncertain (wide) loss to contribute the same total mass as a precise one.

Overall, the mask system encodes the principle that certain fragmentation patterns are more common across different molecule types, while still allowing for molecule-specific variations. It combines learned chemical knowledge with adaptive prediction capabilities.

3.3.4. Model Training

The target spectra used for model training are L1 normalized. This can be thought of as representing the spectra as discrete probability distributions. The loss function used for training is a combination of two components: reconstruction and retrieval loss. Reconstruction loss is inspired by the loss function used in [35]:

[eqn]

where $[eqn]$ are clamped predictions $[eqn]$ , with per sample predictions $[eqn]$ . Target spectra are denoted by $[eqn]$ , $[eqn]$ is the $[eqn]$ value found a the k-th index and B is the batch size. $[eqn]$ is a small positive constant used to prevent errors like division by zero or issues arising from calculating the square root of zero. Normalizing the reconstruction loss only with the true spectra ensures that the magnitude of the loss is scaled appropriately for the complexity of the ground truth spectrum, without allowing the predicted spectrum’s own magnitude to interfere with the learning direction. This makes it a much safer and more robust choice for training than a loss that normalizes by both the predicted and true signals.

Retrieval loss component is a cross-entropy type of loss that is calculated between true and predicted spectra embeddings. Those embeddings are obtained through a small projection head (small MLP), that reduces dimensionality of the spectra from 500 to 128, projection head embeddings are L2 normalized. The projection head is trained together with the prediction model; however, it is affected only by retrieval loss, meaning it is decoupled and independent from reconstruction loss, and thus L1 normalized targets and L2 normalized embeddings do not interfere with each other. Let projected L2-normalized embeddings for predictions and targets be

[eqn]

with $[eqn]$ . Compute the score matrix $[eqn]$ whose element $[eqn]$ is

[eqn]

where $[eqn]$ is a hyperparameter. Cross-entropy retrieval loss uses the diagonal as positive pairs (index i matched to i). Let the label for query i be $[eqn]$ . The per-batch retrieval loss is

[eqn]

The final loss function is a sum of reconstruction and retrieval loss

[eqn]

with $[eqn]$ being a hyperparameter. Additionally, as a sanity check during training, a simple cosine similarity loss is computed for the test set. For a single sample i,

[eqn]

The loss per sample is $[eqn]$ . Batch loss

[eqn]

A retrieval loss component is added to maximize recall values for library matching tasks. Moreover, it helps the model navigate angular dependencies better, stabilizing average cosine similarity values for the test set across different experiments. Reconstruction loss on its own has a tendency to overcompensate for differences in the direction of predicted vectors by adjusting their magnitudes, which can result in very low similarity values, even with good reconstruction loss values.

Model parameters are optimized using the Adam algorithm. We also employ a scheduler that reduces the optimizer learning rate on a loss plateau. The maximum number of training epochs is set to 500; however, an early stopping scheme is used, with patience parameter set to 50, which usually results in a number of training epochs no bigger than 350.

4. Conclusions

The proposed hybrid deep learning model successfully addresses the need to augment existing EI-MS spectral libraries by high-quality synthetic spectra. Trained on NIST14 (≤500 Da) and evaluated across ten random seeds, the model achieves strong library-retrieval performance (Recall@10 ≈ 80.8%) and an average raw cosine similarity of ≈0.74, indicating its capacity to generate spectra suitable for augmenting reference libraries. Moreover, ablation studies further confirm the positive influence of all major architecture components, although the bidirectional prediction module exhibited the highest variance.

However, the observed drop-off in Recall@1 (25%) suggests that model generalization remains biased toward molecules structurally similar to those in the training set, indicating an important direction for future improvement. We attribute the remaining errors to the dataset bias, limited spectral uniqueness, and inherent model’s approximations (e.g., integer m/z resolution and mass cutoff), suggesting that these are the main constraints to better identification performance.

Potential future work will focus on improving generalization and library matching tasks. Those could be achieved through a combination of data-centric, representation-level, and training-strategy enhancements. Increasing the data volume, for instance by incorporating newer versions of NIST spectral library, is likely to improve overall model performance, as neural network-based models are known to depend strongly on both the size and diversity of training data. Richer molecular representations could provide more expressive conditioning signals for spectrum prediction. Pretraining the encoders on large-scale self-supervised tasks may further improve robustness when fine-tuned on spectral data. During the training, more sophisticated regularization strategies such as peak-sensitive dropout could lead to higher embedding uniqueness. Finally, applying contrastive or metric learning to spectral embeddings could plausibly result in more effective grouping and the separation of molecular embeddings, thereby improving the model’s discrimination and generalization.

In practice, our approach provides a scalable complement and alternative to computationally expensive quantum-chemical simulations, partially alleviating gaps in library coverage. Future work will focus on further enhancing spectral uniqueness, increasing dataset diversity, and incorporating hybrid physics-informed features to improve model generalization, using the methods described above.

Bibliography60

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Hsieh Y. Korfmacher W.A. Increasing Speed and Throughput When Using HPLC-MS/MS Systems for Drug Metabolism and Pharmacokinetic Screening Curr. Drug Metab.2006747948910.2174/13892000677769796316787157 · doi ↗ · pubmed ↗
2Gingras A.C. Gstaiger M. Raught B. Aebersold R. Analysis of protein complexes using mass spectrometry Nat. Rev. Mol. Cell Biol.2007864565410.1038/nrm 220817593931 · doi ↗ · pubmed ↗
3Timp W. Timp G. Beyond mass spectrometry, the next step in proteomics Sci. Adv.20206 eaax 897810.1126/sciadv.aax 897831950079 PMC 6954058 · doi ↗ · pubmed ↗
4Lai Z. Tsugawa H. Wohlgemuth G. Mehta S. Mueller M. Zheng Y. Ogiwara A. Meissen J. Showalter M. Takeuchi K. Identifying metabolites by integrating metabolome databases with mass spectrometry cheminformatics Nat. Methods 201815535610.1038/nmeth.451229176591 PMC 6358022 · doi ↗ · pubmed ↗
5Stein S.E. Scott D.R. Optimization and testing of mass spectral library search algorithms for compound identification J. Am. Soc. Mass Spectrom.1994585986610.1016/1044-0305(94)87009-824222034 · doi ↗ · pubmed ↗
6Stein S.E. Chemical substructure identification by mass spectral library searching J. Am. Soc. Mass Spectrom.1995664465510.1016/1044-0305(95)00291-K 24214391 · doi ↗ · pubmed ↗
7Linstrom P.J. Mallard W.G. NIST Chemistry Web Book, NIST Standard Reference Database Number 69National Institute of Standards and Technology Gaithersburg, MD, USA 2024 Available online: https://chemdata.nist.gov(accessed on 13 December 2025)
8Bauer C.A. Grimme S. How to Compute Electron Ionization Mass Spectra from First Principles J. Phys. Chem. A 20161203755376610.1021/acs.jpca.6b 0290727139033 · doi ↗ · pubmed ↗