UHRMS Formula Assignment: Diophantine-Based Recalibration Yields Lorentzian Mass Error Distribution as the Limiting Factor

Neda Safaridehkohneh; Albrecht Ott

PMC · DOI:10.1021/jasms.5c00226·December 26, 2025

UHRMS Formula Assignment: Diophantine-Based Recalibration Yields Lorentzian Mass Error Distribution as the Limiting Factor

Neda Safaridehkohneh, Albrecht Ott

PDF

Open Access

TL;DR

This paper introduces a new Diophantine method for assigning molecular formulas in ultrahigh-resolution mass spectrometry, improving accuracy by distinguishing systematic and random mass errors.

Contribution

A novel Diophantine-based recalibration method is introduced, achieving Lorentzian mass error distribution as a limiting factor for UHRMS.

Findings

01

The Diophantine method distinguishes systematic and random mass errors in UHRMS.

02

Correcting systematic errors results in a Lorentzian random mass error distribution.

03

The method achieves molecular formula assignments near the theoretical accuracy limit.

Abstract

Ultrahigh-resolution mass spectrometry (UHRMS) is a well-established analytical method for characterizing complex molecular mixtures. It is usually performed with Fourier transform techniques, based either on ion cyclotron resonance (FTICR-MS) or mass-dependent oscillations in an ion trap (FT-Orbitrap-MS). In spite of the high technical level of these instruments, often spectral interpretation remains difficult, in particular in a nontargeted approach of complex samples. Here, we introduce a Diophantine method for molecular formula assignment. Taking the ubiquitous Gaussian distribution as an example, we first show how knowledge about random mass error can be used to assign molecular formulas in a statistically consistent way. By considering all possible attributions within a large mass error range, we show how the systematic error stemming from suboptimal calibration can be…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Chemicals1

Diophantine

Figures12

Click any figure to enlarge with its caption.

Isotope filtering process. The x-axis represents the mass-to-charge ratio. The y-axis indicates the peak height of detected masses in the mass spectrum. 1.003355 corresponds to the mass difference between 12C and 13C. For a given mass, M 1, a paired isotopic compound must be expected at M 1 + 1.003355. With δM, the half-confidence limit of the measurement, all masses within M 1 + 1.003355 ± δM (blue rectangle) are considered to be possible isotopes of M 1.

Flowchart of the proposed algorithm: after parameter initialization, the m/z list is isotope-filtered into paired and unpaired subsets. Both undergo Diophantine assignment, generating candidate formulas filtered by six of the seven Golden Rules. Chemically plausible formulas are stored in the output.

Mass deviation (within ±1 ppm) of all candidate molecular formulas for the perfectly calibrated synthetic data set of Green and Perdue as generated by the Diophantine solutions up to m/z = 500. We applied the elemental restrictions as given by the data (CHONPS). We applied the seven golden rules. Black dots denote correct assignments from the target list; other dots are incorrect candidate assignments. Each line represents a compound family obtained by adding or subtracting a specific LMM from a baseline formula, producing a fixed exact mass deviation. Three such families are highlighted in red, e.g. LMM1 = C7H–2O–8NS with ΔM = 1.76 × 10–4 Da.

Percentage of correctly assigned compounds of the target list as a function of the random mass error imposed on the data. As the mass error increases, the ability to accurately identify compounds diminishes, resulting in fewer correct identifications, i.e., the number of red points in Figure decreases as the mass error increases. Dots represent data from simulations; line is a guide for the eye.

Percentage of correctly identified isotope pairs as a function of mass error, using the default parameters for isotope identification. As the mass error rises, the number of correctly identified isotopes decreases, subsequently affecting the assignments. Dots represent data points. The gray line represents an exponential function as fitted to the data.

Percentage of correctly assigned isotopes of the target list at 0.5 ppm mass error as a function of multiples of the default isotope search window (δM = 0.0001678 Da). The line is a guide for the eye. The star marks results using default parameters (four-decimal precision in KMD analysis); other points correspond to three-decimal precision in KMD. Beyond a search window four times the default, the gain is minor. We conclude that the isotope search window needs to be adjusted as a function of the mass error of the data.

Percentage of correctly assigned components as a function of input mass error. The teal dots represent the identifications using the default parameter set. The orange dots represent the upper limit of accuracy according to the cumulative distribution function of a Gaussian. Refining the isotope filtering function increases the number of correctly assigned components by about 12% percent (arrow and star).

Mass deviation of all possible assignments within ±1 ppm from a decalibrated data set, subject to a random mass error of 0.05 ppm. Black dots represent the decalibrated data set without random mass error. Red points correspond to correctly identified masses from the target mass list. The pattern is asymmetrically distributed around the baseline, closely following the decalibration function.

Mass deviation of all possible assignments within ± 1 ppm of deviation for a mass spectrum from a prebiotic molecular broth. The top panel represents the raw data of the measurement. The middle panel represents the pattern after “recalibration” based on the internal calibrants, highlighted as orange plus signs before, and black stars after recalibration. The bottom panel shows a pattern resulting from a second recalibration using identified molecules (orange plus signs) from the populated pattern at positions where no internal calibrants were present (see Figure 2a-SI for the corresponding pattern after a first recalibration). The mass deviation of the identified molecules after recalibration corresponds to zero ppm (not shown).

Autocorrelation function of mass deviations from the baseline: initial data (blue) and recalibrated data (orange). The reduced degree of autocorrelation in the recalibrated data confirms increased randomness of the appearing mass error. The lag corresponds to the number of successive data points taken as the “distance” that the autocorrelation coefficient is determined for.

Histogram of mass deviations of all possible assignments based on raw and recalibrated data (see Figure , top and bottom panels). The histogram is constructed using a bin size of 0.5 Da (300 bins over the 150–300 Da range). Compared to a Gaussian distribution, the Lorentzian provides a better fit to the recalibrated data. This is confirmed by the AIC values (insert).

Tables1

1. Constraints on Multiple-Element Numbers for Compounds up to 2000 Da as Adapted from Kind et al.

elements	constraints
N, O, P, S > 1	N < 10, O < 20, P < 4, S < 3
N, O, P > 3	N < 11, O < 22, P < 6
O, P, S > 1	O < 14, P < 3, S < 3
N, P, S > 1	N < 4, P < 3, S < 3
N, O, S > 6	N < 19, O < 14, S < 8

Funding2

—Deutsche Forschungsgemeinschaft10.13039/501100001659
—Universit?t des Saarlandes10.13039/501100005690

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMass Spectrometry Techniques and Applications · Analytical Chemistry and Chromatography · Metabolomics and Mass Spectrometry Studies

Full text

Introduction

1

Ultrahigh-resolution mass spectrometry (UHRMS), typically exceeding a resolving power of 100,000, is performed on Fourier transform ion cyclotron resonance (FTICR-MS) and Fourier transform Orbitrap (FT-Orbitrap) instruments. It is a powerful analytical tool, widely used to characterize complex chemical mixtures from various sources, including the environment, biological matter, synthetic compounds, pharmaceutical products, as well as the petroleum industry ?−? ? ? ? . UHRMS promises to identify, in a single measurement, a huge number of molecular compounds (up to a few ten thousands) within a complex sample. This type of information can be crucial, for instance, in nontargeted assessments of toxicity of intricate molecular mixtures. It can be of great help in prebiotic scenarios?. This is among many other settings that concern complex mixtures of molecules in industrial applications or scientific research.

A significant large-scale effort to evaluate the reproducibility of measurements and assignments by different groups with different instruments highlighted the critical need for reliable assignment methods ?−? ? ? ? . Interpreting complex spectra in a nontargeted approach remains challenging. Several advanced molecular formula assignment methods have been developed. These include licensed based methods (e.g., Composer from Sierra Analytics, PetroOrg?) as well as open-source database related methods (e.g., UltraMassExplorer (UME),? FuJHA,? Formularity?). However, completeness and correctness of the database are crucially important. Up to 30% of spectra in the NIST EI mass spectral database were claimed to be incorrectly assigned?. There are also open-source calculation methods (e.g., CHO-FIT ?−? ? ) that do not require a database. A strong limitation in the assignment process stems from the combinatorial explosion accompanying higher mass molecules made of several atomic species, making purely calculative approaches challenging. Graph theoretical - and Diophantine approaches have yielded elegant methods for improvement ?−? ? . The CHOFIT method ?,? used low mass moieties limited to CHO, leading to excellent results on small molecules.

Despite the high precision of UHRMS techniques and devices, incorrect calibration often remains a significant source of error that complicates accurate assignment. It is advised to use internal calibration standards to control deviations that may occur due to space charges in the cyclotron (or the trap) ?,? . Space-charge effects in FT-ICR cause frequency shifts that depend not only on total ion abundance but also on ion cloud interactions, which can be linked to local peak densities and amplitudes ?−? ? so that corrections can be performed?. In Orbitrap instruments, space-charge effects can be pronounced due to ion-cloud interactions and field nonidealities ?,? . Local ion–ion interactions can lead to peak coalescence so that closely spaced ion peaks merge into a single detected peak. This effect is particularly pronounced if the relative abundance ratios are high?, causing the weaker ion cloud (the weaker peak) to lose coherence and collapse into the stronger one. This has been experimentally observed on Orbitraps? and FT-ICR instruments ?,? . As a result, additional measurement strategies can be useful to improve accuracy beyond ppm error. Segment-based approaches like OCULAR hold great promise in complex mixtures, where they improve the number of detected substances as well as the accuracy of the determined m/z as compared to simple, one shot injection?.

Error and noise are an inherent parts of every data acquisition process. The finite residence time of ions in the trap represents a physical limitation causing an exponentially decaying signal in the time domain resulting, due to the Fourier transformation, in a Lorentzian peak shape in the frequency domain. If the signal is cut off before decay, a sinc-shaped peak will appear. The peak shape directly corresponds to the theoretical error distribution caused by the (unavoidable) limitation in data?. Moreover, weakly concentrated ions will exhibit more random peak shapes due to the limited statistics that come with a vanishing number of ions?. Other sources of mass inaccuracy, for example from detection electronics, may increase error, although in well-designed instruments, they should not contribute by much?. Mathur? showed that the quality of the preamplifier sets the mass error of the baseline, which limits the instrument in the detection of weakly concentrated ions. For Q-TOF mass spectrometry, different methods have been developed to model and denoise acquired data ?−? ? ? ;their systematic application in FT-based mass spectrometry remains limited. To our knowledge, for FT mass spectrometry, the impact of noise and error on the accuracy of assignments has not yet been fully worked out.

In this study, we use linear Diophantine equations for assigning chemical formulas in a combinatorial way to data from UHRMS. However, instead of iteratively assigning peaks and checking the plausibility of the quality of the assignments at the end by looking at the statistics of mass deviations, here we take the opposite route. We start by considering all possible assignments (correct or not) that fall within a suitable range of mass deviation. As their mass deviation is plotted as a function of m/z, due to the nature of the solutions to the Diophantine equations, a characteristic, symmetric pattern must appear. We first investigate up to which point an intrinsic, random mass error from the instrument limits the assignments in a perfectly calibrated setting. We then show that deviations due to imperfect calibration can be spotted and corrected for, since a correct calibration necessarily aligns with the symmetric Diophantine pattern. We apply our method to data from a prebiotic broth generated in a Miller-Urey type experiment and analyzed on a Bruker Solarix FTICR-MS (without apodization) using electrospray ionization (ESI). We recalibrate the measurement to a better degree than what is achieved by internal calibration to obtain a basically Lorentzian distribution of the mass error as the limiting factor for assignment. It is this error that needs to be taken as the intrinsic error for a statistically meaningful assignment.

Background: Diophantine Equations

2

Diophantine equations, a class of equations requiring integer solutions, have already been discussed in the context of mass spectrometry? . They provide a natural framework for molecular formula assignment, since each measured nominal mass (the integer of its exact mass) must equal the sum of integer multiples of the nominal masses of the constituent atomic species. Usually, the uncertainty of the instrument is such that the measurement of the nominal mass is not affected by error. Accordingly, the molecular formula of the detected nominal mass M must be among the solutions of the Diophantine equation:

[eqn]

where i indicates the different atomic species under consideration; n _ i _ is the integer count of species i; and M _ i _ is its nominal mass. This formulation follows from the discrete nature of atoms, which defines the problem as Diophantine. Unlike classical enumeration-based approaches, this formulation generates candidate formulas systematically through combinations of so-called “special” and “homogeneous solutions”.

A special solution of the Diophantine equation can easily be obtained as follows.

(i) carbon count: calculating the count of carbons through integer division of M by the mass of carbon:

[eqn]

(ii) hydrogen count: the number of hydrogen atoms can be obtained as the remainder of this division:

[eqn]

This pair, (n _ C _, n _ H _), serves only as a mathematical starting point. In some cases, for example M = 72, the special solution yields only (six) carbons and no hydrogens. Our algorithm adjusts such cases (e.g., by subtracting one carbon and adding 12 hydrogens) before proceeding.

Even for more than two atomic species, all solutions that yield the nominal mass M can be obtained from one special solution (any special solution will do) by adding all solutions of the homogeneous Diophantine equation:

[eqn]

where n _ i _ are integers that can be positive or negative, and M _ i _ is the nominal mass of the atomic species i. Note that the exact masses corresponding to the solutions of eq are nonzero, so that the elements of the set of Diophantine solutions (the set of possible assignments for the nominal mass M) differ in their exact masses. To give an example, possible solutions to the homogeneous equation include C 4 O –3 and C –4 H 2 O 2 N with exact masses of 1.5255 × 10^–2^ Da and 8.554 × 10^–6^ Da, respectively.

Solutions of the homogeneous equation form a vector space with integer coefficients, spanned by A – 1 base vectors where A is the number of atomic species i that come into play (see SI, “Diophantine Vector Space” for details). Each base vector v _ i _ has a nominal mass of zero, but an exact mass m _ v _ i _ _ ≠ 0.

For formula assignment, the exact mass of the special solution, $[eqn]$ , is compared to the measured mass M _ M _:

[eqn]

The goal is to find integer coefficients λ_ v _ i _ _ to match |ΔM| within the error δ that must be chosen as a function of the accuracy of the measurement:

[eqn]

All possibly assigned molecular formulas are given by the special solution plus homogeneous solutions from linear combinations of base vectors λ_ v _ v that satisfy the condition in eq.

Example

2.1

Consider a detected, protonated ion with m/z = 90.05495547 Da. The corresponding neutral mass is 89.047679 Da, with a nominal mass of 89 Da. A special solution is C_7_H_5_, with an exact mass of 89.039125 Da. The mass deviation from the neutral mass is ΔM = 89.047679 – 89.039125 = 0.008554 Da. Within one ppm tolerance, all solutions to eq in the range (8.9 × 10^–5^ ± 0.008554 Da) are considered. The best elemental match is C_–4_H_2_O_2_N (8.554 × 10^–3^ Da). Adding this to the special solution reveals C_3_H_7_O_2_N, corresponding to Alanine.

2.2 Significance of the parameter δ. Note that δ in eq sets a finite volume in vector space, in other words, a finite number of possible solutions/assignments (at least if we impose limits on the number of atomic species). We understand that this volume increases as a power law with the error of the measurement with an exponent that depends on the number of elements under consideration. However, the number of possible assignments increases exponentially (or geometrically) with the number of atomic species. If the instrument maintains a certain error in terms of ppms across the considered mass range, the absolute value of δ increases proportionally to the mass under consideration, that is, the number of possible assignments grows as a power law. Unfortunately, ICR and Orbitrap instruments often show an increasing relative error with mass ?,? . As a result, the number of possible assignments grows faster than polynomially with m/z and faster than exponentially with the number of atomic species. However, some instruments maintain a roughly constant relative error across the considered m/z range?. We understand that the accuracy represented by the parameter δ is crucial; however, a priori, the accuracy of the measurement is unknown since many parameters may contribute to the deviation of the measurement from the ideal case.

Computational Details

3

Optimized Implementation of the Diophantine

Algorithm

3.1

To speed up computation, we do not simply apply the condition in eq. Instead, we first compute a library of solutions to the homogeneous Diophantine equation (eq). The library is limited to positive exact masses, sorted in ascending order. The library consists of low-mass compounds containing carbon, hydrogen, oxygen, nitrogen, sulfur, and phosphorus, limited as follows: |C|≤ 40, |H|≤ 140, |O|≤ 20, |N|≤ 12, |P|≤ 8, |S|≤ 8. The limits on each element were chosen based on known sample composition, and a trial-and-error process: we adjusted the limits to ensure that every plausible compound could be generated in the library while keeping the computational load manageable. These constraints are in line with the “Seven Golden Rules”?. We refer to the elements of the library as low-mass moieties (LMMs) that occupy the negative van Krevelen space.?

With these limitations on LMMs, we can find an assignment for masses up to 1500 Da within the defined accuracy. Here, we only consider masses up to 500 Da for synthetic data and up to 300 Da for experimental data. Therefore, the limitations imposed on LMMs should not matter for the cases considered. To generate all exact masses within the tolerance δ of the measured mass following eq, we extract the corresponding subset of LMMs from this library.

Isotope Pair Detection and Filtering

3.2

Accurate recognition of isotopes is essential for molecular assignments. Here we present a method limited to carbon that is valid for small molecules. Other isotopes have a low probability of appearing in our case so that we can safely neglect them. For cases with a larger probability of more complex isotopologues, more sophisticated algorithms such as IsoSpec need to be applied in spite of the increased computational expense ?,? . Electrospray is used for ionization in our instrument. It is known to produce almost only single charged molecules for small masses? so that we can safely neglect multiply charged molecules.

The isotope filtering process detects paired isotopes in the input mass list. It involves three steps:

(i) Setting the search interval δM: for a given mass M 1 from the input list, masses within the range of M 1 + 1.003355 ± δM are selected, as shown in Figure. δM represents the half-confidence limit based on the expected accuracy of the measurement. By default, we use δM = 0.0001678 Da. The value of 1.003355 corresponds to the mass difference between ^12^ C and ^13^C.

Isotope filtering process. The x-axis represents the mass-to-charge ratio. The y-axis indicates the peak height of detected masses in the mass spectrum. 1.003355 corresponds to the mass difference between 12C and 13C. For a given mass, M 1, a paired isotopic compound must be expected at M 1 + 1.003355. With δM, the half-confidence limit of the measurement, all masses within M 1 + 1.003355 ± δM (blue rectangle) are considered to be possible isotopes of M 1.

(ii) Kendrick Mass Defect (KMD) Analysis: Here, KMD refers to a generalized transform used for isotope detection rather than traditional CH_2_ homologous series analysis. The base unit is set to the ^13^C – ^12^C mass difference (Δm = 1.003355 Da), so that isotopologue series align. This isotopic KMD is applied to flag possible poly isotopic peaks prior to further checks and formula assignment. This saves computational time. As an example, the KMD of a mass M 1 = 235.18064 Da is calculated as follows:

[eqn]

where ⌊.⌋ represents the rounding down function. Two masses are considered a match if their KMD values are identical to the fourth digit, corresponding to a tolerance of approximately 0.067 ppm (see SI, “Isotope Filtering Tolerance Calculation”, for details).

(iii) Relative Intensity Analysis: This is a postfiltering analysis that is performed after assignment. The peak heights of the isotope candidates are checked against the expected abundance ratios. Here, we allow for a deviation of 10*%* in the logarithm of relative peak heights. This deviation must be set depending on the response function of the instrument.

A polyisotopic pair is confirmed if it meets all the three conditions above. See SI under “Isotope Filtering and Relative Intensity Analysis” for details on checks of the relative heights of isotopic peaks, including an exemplary calculation.

Selecting Chemically Plausible Formulas Using

the Seven Golden Rules

3.3

Compared to a purely combinatorial approach, many chemically nonmeaningful molecular formulas can be excluded by applying six of the seven Golden Rules?. The seventh rule is specific to small molecules in clinical and metabolomics studies using GC/MS, beyond the scope of this work, so, we dropped it.

The rules include: (i) restriction of element counts, (ii) compliance with LEWIS and SENIOR rules, (iii) verification of polyisotopic patterns, (iv) assessment of hydrogen-to-carbon ratios, (v) evaluation of nitrogen, oxygen, phosphorus, and sulfur to hydrogen ratios, and (vi) checks for high-probability, multiple element-to-carbon ratios.

Moreover, elemental ratios are limited as follows: H ≥ 2, $[eqn]$ , $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ . Table summarizes the numeric constraints on the presence of multiple elements within a molecule.

1: Constraints on Multiple-Element Numbers for Compounds up to 2000 Da as Adapted from Kind et al.

Algorithm and Computational Details

3.4

Figure illustrates the algorithm. It starts by setting key parameters, such as the mass interval for isotope filtering, peak height deviation tolerance for isotopic pair verification, and mass deviation tolerance for assignments. These parameters can be modified by the user as a function of requirements. It continues by detecting polyisotopic pairs. The input data in the form of a list is divided in two: paired and unpaired masses. These two sublists are then processed separately using the Diophantine algorithm for molecular formula determination. Candidate formulas are checked against six of the seven Golden Rules. Formulas meeting these criteria are stored in the output list for further analysis. This implies that, for a single input mass, most of the time multiple possible molecular formulas are suggested.

Flowchart of the proposed algorithm: after parameter initialization, the m/z list is isotope-filtered into paired and unpaired subsets. Both undergo Diophantine assignment, generating candidate formulas filtered by six of the seven Golden Rules. Chemically plausible formulas are stored in the output.

Outline of the Calibration and Assignment

Process

3.5

Classical molecular formula assignment algorithms often rely on some form of recalibration, either by identifying internal calibrants or using an algorithm, for example the CHOFIT method, which tries to find the molecules made only from CHO, in the hope of attributing this limited amount of molecules correctly?. This is followed by brute-force enumeration of all possible combinations of elements up to certain limits, followed by heuristic filters such as the “Seven Golden Rules”?. This produces a figure of merit, where the mass deviation of the attributions is compared to a Gaussian error distribution with an error usually below one ppm. In contrast, our approach is explicitly rooted in the mathematical structure of Diophantine equations. For calibration, we know that we eventually need to generate a symmetric pattern if we plot the mass deviation errors of all possible assignments (including candidate assignments beyond the correct ones) as a function of m/z. The symmetry is because of the homogeneous solutions that can be added or subtracted from the correct assignment represented by the axis of symmetry. The pattern emerges from the discreteness of the Diophantine solutions. We then extract an error from the calibrated data. This error should be fed into the assignment process as detailed in the first part of this paper to perform a consistent assignment. With that, the fraction of false attributions is known. Our method is simple and fast compared to any other methods that we have found. It yields results close to what is technically achievable, given the data as delivered by the instrument.

Methods

4

Hardware

4.1

We used a personal computer with a Linux (Ubuntu) operating system, featuring a 12th generation Intel Core i7 processor capable of operating in 32 GB and 64 GB modes, running at 2.0 GHz. The algorithm was programmed in house using Python (3.10.12). Each run of assignments for a mass list between 150 and 500 Da typically required two to 3 h of time, not including the setup of the library. A data set with imposed random Gaussian mass error required more computational time due to increased complexity.

Data Sets

4.2

Synthetical Data Set from Green and Perdue

4.2.1

The mass list from Green and Perdue? comprises combinations of isotopes of carbon, hydrogen, oxygen, nitrogen, sulfur, and phosphorus, spanning masses from 150 to 1500 Da. For practical purposes, we restricted the mass range to 150–500 Da.

Adding Mass Error and Decalibration

4.2.2

We added Gaussian mass error to the synthetical data set. The Gaussian mass error distribution is characterized by zero mean and a defined standard deviation, quantifying the mass error level. For a given exact mass, M, Gaussian mass error produced a statistical error ΔM resulting in a new mass value M′ = M + ΔM. ΔM can be positive or negative. The standard deviation was set to an average value expressed as parts per million (ppm). Note that since the standard deviation was fixed in absolute terms, the error expressed in ppm was higher for small and smaller for large m/z. Setting the error in absolute or in relative terms is not expected to affect our conclusions. In practice, the error and its dependence on m/z is a function of the instrument and needs to be determined from the data (cf. Figure for an example).

We decalibrated the data set with added Gaussian mass error using a quadratic polynomial, inspired by the classical quadratic calibration equation of Ledford and Gross?.

Experimental Data Set

4.2.3

We performed mass spectrometry on a prebiotic broth, prepared as described in ?,? . The sample volume, a few ml, was filtered with 2 μm pore size before measurement. We performed direct injection using ESI on a Bruker Solarix FTICR instrument equipped with a 7 T superconducting magnet. The transient duration was 0.4194 s, with an ion accumulation time of 0.025 s, and an average of 256 scans. These acquisition parameters were optimized for the low mass range (<300 Da) setting. We used the Bruker software (Bruker Compass Data Analysis 5.0 SRI x64) for Fourier transform and peak picking. The data was then exported to a computer for the application of our method. Other instrument settings are given in the Supporting Information Table 1.

Results and Discussion

5

Unless stated otherwise, we applied the standard isotope filtering parameters as given in the methods section.

Assignments in the Absence of Mass Error for

a Perfectly Calibrated Data Set

5.1

We begin by analyzing the perfectly calibrated synthetic data set from Green and Perdue?. Figure illustrates the mass deviation of all possible assignments, up to the limit of one ppm, as a function of m/z. Correct assignments are limited to the baseline (at zero ppm). A distinct, regular, almost symmetric line pattern is observed, asymptotically converging toward the baseline. Each line corresponds to a family of compounds sharing a specific LMM in the Diophantine solution leading to a fixed, exact mass deviation with respect to the baseline. Based on the special and the homogeneous Diophantine equations, LMMs can be added or subtracted, resulting in molecular formulas that appear symmetrically positioned with respect to the baseline. A small degree of asymmetry is due to constraints from the seven golden rules.

Mass deviation (within ±1 ppm) of all candidate molecular formulas for the perfectly calibrated synthetic data set of Green and Perdue as generated by the Diophantine solutions up to m/z = 500. We applied the elemental restrictions as given by the data (CHONPS). We applied the seven golden rules. Black dots denote correct assignments from the target list; other dots are incorrect candidate assignments. Each line represents a compound family obtained by adding or subtracting a specific LMM from a baseline formula, producing a fixed exact mass deviation. Three such families are highlighted in red, e.g. LMM1 = C7H–2O–8NS with ΔM = 1.76 × 10–4 Da.

Assignments in the Presence of Gaussian Mass

Error

5.2

Figure shows possible assignments generated by our algorithm for data with imposed Gaussian mass error with standard deviations of 0.05, and 0.09 ppm. In the graph, increasing the mass error reduces the resolution of the initial pattern, however, up to this level of mass inaccuracy, the pattern does not vanish.

Mass deviation (within ±1 ppm) of all candidate molecular formulas for the perfectly calibrated synthetic data set of Green and Perdue as generated by the Diophantine algorithm up to m/z = 500. Top panel: perfectly calibrated data set without error (Figure ). Middle and bottom panels: the same data set with imposed random Gaussian mass error of 0.05 and 0.09 ppm. Correct assignments from target mass list as performed by our algorithm are in red; alternative candidate formulas in blue. A higher level of mass error reduces the sharpness of the pattern but maintains the symmetry relative to the baseline.

We observe that the presence of mass error increases the total number of assignments (the number of blue points in Figure) from 13,280 in the ideal (noise free and perfectly calibrated data set to 13,630 at 0.09 ppm mass error. To assess the quality of the assignment, we determine the percentage of correctly identified formulas from the target mass list (150–500 Da) as a function of mass error (Figure). At 0.86 ppm mass error (a very high mass inaccuracy for UHRMS), approximately 66% of compounds, a substantial portion  1172 out of 1789  were still correctly identified.

Percentage of correctly assigned compounds of the target list as a function of the random mass error imposed on the data. As the mass error increases, the ability to accurately identify compounds diminishes, resulting in fewer correct identifications, i.e., the number of red points in Figure decreases as the mass error increases. Dots represent data from simulations; line is a guide for the eye.

Approximately 34% of the target mass list (608 out of 1789) consists of polyisotopic pairs. As long as mass error is not considered in isotope filtering, these compounds remain poorly identified (Figure). At 0.86 ppm mass error, 90% of polyisotopic compounds are lost. However, since missed polyisotopic pairs are reassigned to molecules with ^12^C, only half of these molecules are lost for correct assignment. This results in an effective 15% (half of 34% × 90%) decrease in correct assignment.

Percentage of correctly identified isotope pairs as a function of mass error, using the default parameters for isotope identification. As the mass error rises, the number of correctly identified isotopes decreases, subsequently affecting the assignments. Dots represent data points. The gray line represents an exponential function as fitted to the data.

The identification of isotopes for the synthetic data set is a two-step process, allowing for two potential improvements in the presence of mass error: (i) broadening the confidence limit for polyisotopic pairs by increasing ΔM (Figure) and (ii) relaxing the precision of the Kendrick mass defect (KMD) considering identical masses up to three decimal places instead of four.

Percentage of correctly assigned isotopes of the target list at 0.5 ppm mass error as a function of multiples of the default isotope search window (δM = 0.0001678 Da). The line is a guide for the eye. The star marks results using default parameters (four-decimal precision in KMD analysis); other points correspond to three-decimal precision in KMD. Beyond a search window four times the default, the gain is minor. We conclude that the isotope search window needs to be adjusted as a function of the mass error of the data.

In the following, as an example, we set the mass error of the data set to 0.5 ppm. We see (Figure) that the default parameters correctly identified 82% of the compounds (1458 out of 1789). Relaxing the KMD precision to three significant digits improved this to 1524, an increase of about 5%. Expanding the search interval by a factor of 4 while maintaining three-decimal KMD precision increased the correct identifications to 1633, resulting in a 12% improvement. Increasing the confidence limit beyond this threshold did not yield significant gains, but extended the computational time.

Based on the cumulative distribution function of a Gaussian, approximately 5% of assignments fall outside 2σ and remain unidentified. Since we limit ourselves to a deviation of 1 ppm, the theoretical maximum for correct assignments at a mass error of 0.5 ppm, is 95% (1699 compounds). With 1633 identified compounds (91.5%), we approach this limit rather well. Figure illustrates the limit across different levels of mass error and the fraction of correct assignments (using the default parameter set). The relationship between the two is nontrivial with the fraction of correct assignments varying with mass error levels. The fraction of correct assignments remains bounded, however, by the cumulative Gaussian distribution if we limit ourselves to assignments within ±1 ppm.

Percentage of correctly assigned components as a function of input mass error. The teal dots represent the identifications using the default parameter set. The orange dots represent the upper limit of accuracy according to the cumulative distribution function of a Gaussian. Refining the isotope filtering function increases the number of correctly assigned components by about 12% percent (arrow and star).

Detecting Ill-Calibrated Data

5.2.1

We test our method on the synthetic data set with imposed random mass error that we decalibrated (see methods for details). The assignment process used the optimized parameters for isotope detection; i.e., a 4ΔM isotope search interval and the (relaxed) Kendrick Mass Defect (KMD) precision to the third significant digit. Results are shown in Figure. The decalibration function is directly visible from the shift of the entire pattern.

Mass deviation of all possible assignments within ±1 ppm from a decalibrated data set, subject to a random mass error of 0.05 ppm. Black dots represent the decalibrated data set without random mass error. Red points correspond to correctly identified masses from the target mass list. The pattern is asymmetrically distributed around the baseline, closely following the decalibration function.

Application to Experimental Data

5.3

We produced a prebiotic soup made from CHON (for details on production, see ?,? ). We added internal calibrants (listed in Table 3-SI). The instrument, a 7 T Bruker Solarix FTICR, was calibrated before measurement according to specifications from the manufacturer. We performed direct injection using ESI (see Methods for further details). Figure (top) shows the mass deviations of possible CHON assignments from the measured mass values. The emerging pattern, however, does not align with the baseline, suggesting the presence of systematic error.

Mass deviation of all possible assignments within ± 1 ppm of deviation for a mass spectrum from a prebiotic molecular broth. The top panel represents the raw data of the measurement. The middle panel represents the pattern after “recalibration” based on the internal calibrants, highlighted as orange plus signs before, and black stars after recalibration. The bottom panel shows a pattern resulting from a second recalibration using identified molecules (orange plus signs) from the populated pattern at positions where no internal calibrants were present (see Figure 2a-SI for the corresponding pattern after a first recalibration). The mass deviation of the identified molecules after recalibration corresponds to zero ppm (not shown).

To address this, we identified the internal calibrants (see SI Table 3) by matching their known chemical formulas to assigned entries in the data set and extracting their mass deviations for recalibration. These calibrants are highlighted as orange plus signs in Figure (top). They align along the most densely populated line of the plot. We then recalibrated the data by dividing the data into segments that contain three calibrants each and fitting a quadratic function accross the calibrants. This approach is inspired by the classical quadratic calibration equation of Ledford and Gross? while the segment-wise implementation is analogous to the “walking calibration” strategy described by Savory?. We detailed the process in the SI, Sec. 0.5. As shown in Figure (middle), recalibration with the internal calibrants aligns the pattern along the baseline.

Building on this, we performed an additional recalibration step using six assigned compounds. We combined them with the previously used internal calibrants (sets 2 in SI Sec. 0.5). This further aligned the most populated line to the baseline and improved the symmetry of the pattern as shown in Figure, bottom. We give detailed information on the calibration process in the Supporting Information.

We quantified the degree of randomness in the deviation of the assignments along the baseline using the autocorrelation function (ACF), defined as

[eqn]

where r _ i _ represents the mass deviation from the baseline of the i ^th^ assigned mass and k denotes the lag. Figure shows the results. The lower degree of autocorrelation in the recalibrated data confirms a more random pattern, indicating that systematic deviations were reduced.

Autocorrelation function of mass deviations from the baseline: initial data (blue) and recalibrated data (orange). The reduced degree of autocorrelation in the recalibrated data confirms increased randomness of the appearing mass error. The lag corresponds to the number of successive data points taken as the “distance” that the autocorrelation coefficient is determined for.

In FT mass spectrometry, m/z values are determined by Fourier transform of the transient. If the transient is cut off early, this leads to a sinc shaped peak in the frequency domain. For transients that decay faster, however, the peak shape is Lorentzian (due to the exponential decay of the signal at the corresponding frequency). The peak shape corresponds to the uncertainty resulting from the limited data in the time domain. The m/z values are obtained by centroiding these peaks. Accordingly, the horizontal (m/z) uncertainties must follow the same Lorentzian (or sinc) distribution, other errors neglected. If, like in our case, the observation time of the transients is relatively long, the distribution of the m/z errors is expected to retain a Lorentzian profile – as opposed to a Gaussian. This is because the central limit theorem does not apply to Lorentzians. Note that others obtain similarly shaped distributions after recalibration?. However, it is clear that if the Lorentzian shape is lost, either due to very low amounts of ions or elevated amounts leading to a sinc function, the average resulting error should appear closer to a Gaussian.

To determine whether a Gaussian or Lorentzian function better fits the mass deviations after recalibration, we applied the Akaike Information Criterion (AIC) test (see SI Sec. 0.5 for details). Only after recalibration did the assignments clearly favor the Lorentzian distribution, consistent with visual inspection (Figure).

Histogram of mass deviations of all possible assignments based on raw and recalibrated data (see Figure , top and bottom panels). The histogram is constructed using a bin size of 0.5 Da (300 bins over the 150–300 Da range). Compared to a Gaussian distribution, the Lorentzian provides a better fit to the recalibrated data. This is confirmed by the AIC values (insert).

Note that the half width at half-maximum (HWHM) of the Lorentzian fit to the mass deviation distribution is beyond 0.2 ppm (See Figure). The calibration error appears negligible compared to this mass error. This becomes clear already in Figure, bottom, where the deviation of the most populous line from the baseline is smaller than the deviations cause by the mass error. In?, the authors achieved even slightly narrower error at about double m/z, however, this was at the expense of a much more sophisticated algorithm and data calibration correction and for a different sample. It will be interesting to see in future work, if our approach can be improved to reach similar statistics.

In our analysis of the synthetic data above, we focused on a Gaussian mass error distribution, which is not what we find in Figure. However, since the distribution function of the Lorentzian is limited by the noise floor, in practice, this function is finite. Accordingly we do not expect any principal differences in transferring our results to that case.

Given the results above, an automatic recalibration strategy could consist of identifying the most populous regions in the mass deviation plots by an edge detection algorithm (see SI for examples), and using the edges for recalibration. However, we reserve a detailed study about optimization and automation for future work. Many details, beyond the scope of the here presented work, remain to be investigated. Alternative strategies such cross-correlation to establish the link between the theoretical and the experimentally observed pattern  may enhance assignment precision.

Moreover, one can feed our algorithm with improved data along already established directions. Our approach could be performed in Fourier space, where a gain in resolution can be achieved, in principle up to a factor of 2?. A further aspect to consider is the influence of concentration as reflected by peak intensity. Highly concentrated molecules generate much longer transients, leading to more intense and narrow Lorentzians. This means that, in principle, these should have a stronger weight in the calibration and attribution process. However, at the same time they tend to deviate more in m/z due to their increased space charge in the synchrotron. However, this dependence can be quantified and corrected for?. In the past, improved experimental approaches have been proposed. One of them is the OCULAR technique?, where scanning the entire mass range as small sections defined by a Quadrupole could greatly increase the precision of the m/z measurement while substantially increasing the number of detected substances as part of a highly complex mixture. The latter would improve our method, since more data is available for recalibration. At the same time it is a segment-wise measurement, which fits well to our segment-wise recalibration strategy. Moreover, better statistics on chemically meaningful molecular formulas that go beyond the seven golden rules could improve the assignments further.

Conclusions

6

Here we presented a novel method for assigning chemical formulas to mass spectrometry data in a combinatorial approach. Using linear Diophantine equations and corresponding low-mass moieties (LMMs), we looked at the ensemble of possible attributions of chemically plausible combinations within a given accuracy. Plotting mass deviations as a function of m/z, this generated a symmetric line pattern in a perfectly calibrated set. The symmetry stems from the low mass moieties that appear as general solutions to the corresponding homogeneous Diophantine equations. The symmetry was used for recalibrating the data. Recalibration yielded a Lorentzian mass error spectrum as expected from the physical limitations of the instrument. This spectrum quantifies the level of random mass error, an information that is crucial to achieve predictable likelihoods of correct assignments in a consistent approach. Note that at the same time our method delivers an excellent tool to optimize the settings of an instrument, since the mass inaccuracy becomes easily quantifiable.

Supplementary Material

Bibliography51

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Cooper W. T.Chanton J. C.D’Andrilli J.Hodgkins S. B.Podgorski D. C.Stenson A. C.Tfaily M. M.Wilson R. M.A history of molecular level analysis of natural organic matter by FTICR mass spectrometry and the paradigm shift in organic geochemistry Mass Spectrom. Rev.20224121523910.1002/mas.2166333368436 · doi ↗ · pubmed ↗
2Qi Y.Fu P.Volmer D. A.Analysis of natural organic matter via fourier transform ion cyclotron resonance mass spectrometry: an overview of recent non-petroleum applications Mass Spectrom. Rev.20224164766110.1002/mas.2163432412674 · doi ↗ · pubmed ↗
3Schum S. K.Zhang B.Džepina K.Fialho P.Mazzoleni C.Mazzoleni L. R.Molecular and physical characteristics of aerosol at a remote free troposphere site: implications for atmospheric aging Atmospheric Chemistry and Physics 201818140171403610.5194/acp-18-14017-2018 · doi ↗
4Mazzoleni L. R.Ehrmann B. M.Shen X.Marshall A. G.Collett J. L.Jr Water-soluble atmospheric organic matter in fog: exact masses and chemical formula identification by ultrahigh-resolution Fourier transform ion cyclotron resonance mass spectrometry Environ. Sci. Technol.2010443690369710.1021/es 903409 k 20397689 · doi ↗ · pubmed ↗
5Bianco A.Sordello F.Ehn M.Vione D.Passananti M.Degradation of nanoplastics in the environment: Reactivity and impact on atmospheric and surface waters Sci. Total Environ.202074214041310.1016/j.scitotenv.2020.14041332623157 · doi ↗ · pubmed ↗
6Wollrab E.Scherer S.Aubriet F.CarréV.Carlomagno T.Codutti L.Ott A.Chemical analysis of a “Miller-type” complex prebiotic broth: Part I: Chemical diversity, oxygen and nitrogen based polymers Origins of Life and Evolution of Biospheres 20164614916910.1007/s 11084-015-9468-826508401 · doi ↗ · pubmed ↗
7Collins B. C.Hunter C. L.Liu Y.Schilling B.Rosenberger G.Bader S. L.Chan D. W.Gibson B. W.Gingras A.-C.Held J. M.Multi-laboratory assessment of reproducibility, qualitative and quantitative performance of SWATH-mass spectrometry Nat. Commun.2017829110.1038/s 41467-017-00249-528827567 PMC 5566333 · doi ↗ · pubmed ↗
8Poulos R. C.Hains P. G.Shah R.Lucas N.Xavier D.Manda S. S.Anees A.Koh J. M.Mahboob S.Wittman M.Strategies to enable large-scale proteomics for reproducible research Nat. Commun.202011379310.1038/s 41467-020-17641-332732981 PMC 7393074 · doi ↗ · pubmed ↗