MS-BioGraphs: Sequence Similarity Graph Datasets

Mohsen Koohi Esfahani; Paolo Boldi; Hans Vandierendonck; Peter; Kilpatrick; Sebastiano Vigna

arXiv:2308.16744·cs.DC·September 1, 2023·2 cites

MS-BioGraphs: Sequence Similarity Graph Datasets

Mohsen Koohi Esfahani, Paolo Boldi, Hans Vandierendonck, Peter, Kilpatrick, Sebastiano Vigna

PDF

Open Access 1 Repo

TL;DR

This paper introduces MS-BioGraphs, a new set of large, real-world sequence similarity graph datasets created from billions of protein sequences, addressing HPC challenges in data generation and providing valuable resources for graph processing research.

Contribution

The paper presents a novel process for generating extremely large sequence similarity graphs and introduces the MS-BioGraphs dataset family, significantly larger than existing datasets.

Findings

01

Created graphs with up to 2.5 trillion edges from 1.7 billion protein sequences

02

Optimized data structures and algorithms for large-scale graph generation

03

Demonstrated effective parallel compression of large graph datasets

Abstract

Progress in High-Performance Computing in general, and High-Performance Graph Processing in particular, is highly dependent on the availability of publicly-accessible, relevant, and realistic data sets. To ensure continuation of this progress, we (i) investigate and optimize the process of generating large sequence similarity graphs as an HPC challenge and (ii) demonstrate this process in creating MS-BioGraphs, a new family of publicly available real-world edge-weighted graph datasets with up to $2.5$ trillion edges, that is, $6.6$ times greater than the largest graph published recently. The largest graph is created by matching (i.e., all-to-all similarity aligning) $1.7$ billion protein sequences. The MS-BioGraphs family includes also seven subgraphs with different sizes and direction types. We describe two main challenges we faced in generating large graph datasets and our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DIPSA-QUB/MS-BioGraphs-Validation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies · Algorithms and Data Compression · Bioinformatics and Genomic Networks