Accelerating Cloud-Based Transcriptomics: Performance Analysis and Optimization of the STAR Aligner Workflow
Piotr Kica, Sabina Licho{\l}ai, Micha{\l} Orzechowski, Maciej Malawski

TL;DR
This paper presents a scalable cloud-native pipeline for RNA-seq data alignment using STAR, with optimizations that significantly reduce execution time and costs, validated through extensive experiments.
Contribution
It introduces a novel cloud architecture for transcriptomics data processing and implements multiple optimizations, including early stopping, to enhance performance and cost-efficiency.
Findings
Early stopping reduces alignment time by 23%
Optimal EC2 instance types identified for cost-effective cloud deployment
Spot instances are viable for large-scale transcriptomics workflows
Abstract
In this work, we explore the Transcriptomics Atlas pipeline adapted for cost-efficient and high-throughput computing in the cloud. We propose a scalable, cloud-native architecture designed for running a resource-intensive aligner -- STAR -- and processing tens or hundreds of terabytes of RNA-sequencing data. We implement multiple optimization techniques that give significant execution time and cost reduction. The impact of particular optimizations is measured in medium-scale experiments followed by a large-scale experiment that leverages all of them and validates the current design. Early stopping optimization allows a reduction in total alignment time by 23%. We analyze the scalability and efficiency of one of the most widely used sequence aligners. For the cloud environment, we identify one of the most suitable EC2 instance types and verify the applicability of spot instances usage.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Bioinformatics and Genomic Networks
