# Comprehensive Curation and Harmonization of Small-Molecule MS/MS Libraries in Spectraverse

**Authors:** Vishu Gupta, Hantao Qiang, Hsin-Hsiang Chung, Ehud Herbst, Michael A. Skinnider

PMC · DOI: 10.1021/acs.analchem.5c06256 · 2026-01-26

## TL;DR

Spectraverse is a new, comprehensive library of high-quality mass spectra for small molecules, designed to improve metabolite identification and machine learning in metabolomics.

## Contribution

Spectraverse introduces a harmonized, curated MS/MS library addressing quality and metadata issues in public spectral databases.

## Key findings

- Spectraverse includes spectra from major and overlooked repositories after extensive preprocessing.
- The library identifies undocumented pitfalls in public libraries that may have affected machine learning model training.
- Spectraverse offers the broadest coverage of chemical space and ionization modes for metabolomics to date.

## Abstract

Reference libraries
of tandem mass spectra (MS/MS) are widely used
for metabolite identification in untargeted metabolomics and to train
machine-learning models for metabolite annotation. However, public
spectral libraries are scattered across disparate databases and contain
spectra that are of low resolution or quality, missing critical metadata,
or which have chemically incoherent annotations. Addressing these
issues requires extensive preprocessing and considerable expertise
in mass spectrometry, which presents a significant barrier to investigators
interested in developing their own machine-learning models. Here,
we present Spectraverse, a comprehensive and extensively curated library
of public MS/MS spectra from small molecules. We assembled reference
spectra from both major repositories and previously overlooked resources
and then developed a preprocessing pipeline to harmonize metadata,
standardize chemical structures, and remove low-quality or redundant
spectra. These efforts led us to identify previously undocumented
pitfalls in existing public libraries that may have confounded prior
comparisons of machine-learning models or conversely have caused valid
MS/MS spectra to have been discarded from the training sets of these
models. The resulting resource affords the most comprehensive coverage
of chemical space of any machine-learning-ready library of MS/MS spectra
to date while also expanding the coverage of adducts and ionization
modes encountered in metabolomics experiments. We intend to maintain
and expand Spectraverse in order to encompass the growing number of
publicly available reference MS/MS spectra that can be expected to
accumulate in the future.

## Full-text entities

- **Genes:** UBE2F (ubiquitin conjugating enzyme E2 F (putative)) [NCBI Gene 140739] {aka NCE2}
- **Diseases:** MISSING_METADATA (MESH:D000030), ENTRIES (MESH:C557826), COMPLETE (MESH:D001766), DERIVE (MESH:C536408)
- **Chemicals:** salts (MESH:D012492), polyethylene (MESH:D020959), Cl (MESH:D002713), Na (MESH:D012964), K (MESH:D011188), sulfoxides (MESH:D013454), CH3COOH - H (-), H (MESH:D006859)
- **Species:** Caenorhabditis elegans (species) [taxon 6239], Homo sapiens (human, species) [taxon 9606], Solanum lycopersicum (tomato, species) [taxon 4081]

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12903054/full.md

---
Source: https://tomesphere.com/paper/PMC12903054