Scalable Genomic Context Analysis with GCsnap2 on HPC Clusters
Reto Krummenacher, Osman Seckin Simsek, Mich\`ele Leemann, Leila T. Alexander, Torsten Schwede, Florina M. Ciorba, Joana Pereira

TL;DR
GCsnap2 Cluster is a high-performance, scalable tool for genomic context analysis that leverages distributed computing to handle large datasets efficiently in HPC environments.
Contribution
It introduces GCsnap2 Cluster, a scalable, modular tool that significantly improves execution time for genomic analysis on HPC clusters compared to its predecessor.
Findings
22x reduction in execution time
Can analyze hundreds of thousands of sequences
Flexible deployment in various environments
Abstract
GCsnap2 Cluster is a scalable, high performance tool for genomic context analysis, developed to overcome the limitations of its predecessor, GCsnap1 Desktop. Leveraging distributed computing with mpi4py[.]futures, GCsnap2 Cluster achieved a 22x improvement in execution time and can now perform genomic context analysis for hundreds of thousands of input sequences in HPC clusters. Its modular architecture enables the creation of task-specific workflows and flexible deployment in various computational environments, making it well suited for bioinformatics studies of large-scale datasets. This work highlights the potential for applying similar approaches to solve scalability challenges in other scientific domains that rely on large-scale data analysis pipelines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
