Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows
Daniel Mas Montserrat, Ray Verma, M\'iriam Barrab\'es, Francisco M. de la Vega, Carlos D. Bustamante, Alexander G. Ioannidis

TL;DR
This paper introduces adaptive, RAM-efficient parallelization techniques for large-scale genomic workflows, improving resource utilization and reducing memory errors in chromosome-level bioinformatics processing.
Contribution
It presents novel symbolic regression and scheduling methods for dynamic, memory-aware parallelization of genomic workflows, enhancing efficiency over static approaches.
Findings
Reduced memory overruns in genomic workflows
Faster execution times in real-world pipelines
Effective load balancing across threads
Abstract
Large-scale genomic workflows used in precision medicine can process datasets spanning tens to hundreds of gigabytes per sample, leading to high memory spikes, intensive disk I/O, and task failures due to out-of-memory errors. Simple static resource allocation methods struggle to handle the variability in per-chromosome RAM demands, resulting in poor resource utilization and long runtimes. In this work, we propose multiple mechanisms for adaptive, RAM-efficient parallelization of chromosome-level bioinformatics workflows. First, we develop a symbolic regression model that estimates per-chromosome memory consumption for a given task and introduces an interpolating bias to conservatively minimize over-allocation. Second, we present a dynamic scheduler that adaptively predicts RAM usage with a polynomial regression model, treating task packing as a Knapsack problem to optimally batch jobs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsParallel Computing and Optimization Techniques · Genomics and Phylogenetic Studies · Scientific Computing and Data Management
