rMATS-cloud: Large-scale Alternative Splicing Analysis in the Cloud
Jenea I Adams, Eric Kutschera, Qiang Hu, Chun-Jie Liu, Qian Liu, Kathryn Kadash-Edmondson, Song Liu, Yi Xing

TL;DR
This paper introduces rMATS-cloud, a scalable cloud-based tool for analyzing alternative splicing in RNA-seq data, suitable for large-scale biomedical research.
Contribution
The novel contribution is the development of a portable and scalable cloud version of the rMATS workflow for alternative splicing analysis.
Findings
rMATS-cloud efficiently processes RNA-seq datasets with thousands of samples.
The tool is compatible with multiple cloud platforms like Cavatica, Terra, and Seqera.
It is well-suited for cloud storage capacities and large-scale data repositories.
Abstract
Although gene expression analysis pipelines are often a standard part of bioinformatics analysis, with many publicly available cloud workflows, cloud-based alternative splicing analysis tools remain limited. Our lab released rMATS in 2014 and has continuously maintained it, providing a fast and versatile solution for quantifying alternative splicing from RNA sequencing (RNA-seq) data. Here, we present rMATS-cloud, a portable version of the rMATS workflow that can be run in virtually any cloud environment suited for biomedical research. We compared the time and cost of running rMATS-cloud with two RNA-seq datasets on three different platforms (Cavatica, Terra, and Seqera). Our findings demonstrate that rMATS-cloud handles RNA-seq datasets with thousands of samples, and therefore is ideally suited for the storage capacities of many cloud data repositories. rMATS-cloud is available at…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1| Feature | CWL | WDL | Nextflow |
|---|---|---|---|
| Cloud platform |
BioData Catalyst ( Cancer Genomics Cloud ( Cavatica ( |
AnVIL ( BioData Catalyst ( DNAnexus ( Terra ( | Seqera ( |
| URL |
|
|
|
| Platform (workflow language) | EMT Dataset (6 cell line samples; 249.2 million reads per BAM on average) | COG Dataset (1113 patient samples; 173.7 million reads per BAM on average) | ||
|---|---|---|---|---|
| Average time | Average cost | Time | Cost | |
| Cavatica (CWL) | 49.7 min | $0.25 (USD) | 5 h 41 min | $38.70 (USD) |
| Terra (WDL) | 42.3 min | $0.17 (USD) | ||
| Seqera (Nextflow) | 15.3 min | $0.05 (USD) | ||
- —National Institutes of Health10.13039/100000002
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMolecular Biology Techniques and Applications · Gene expression and cancer classification · Genomics and Phylogenetic Studies
Introduction
Today’s multi-omic landscape is fueling ever-growing high-dimensional data with expansive storage and analysis needs. In response, the biomedical research community is increasingly turning to cloud storage rather than housing data solely on private high-performance computing (HPC) systems. With the right tools, researchers can now run end-to-end analysis workflows in environments that also act as highly secure data repositories. Cloud storage sites may also house large-scale biomedical datasets with their built-in analysis suites. Ultimately, cloud computing is changing how researchers collaborate, work, and share reproducible research in ways that save money and time by eliminating the need to download and store massive datasets.
Cloud computing enables the storage and analysis of large-scale, multi-omic (genomic, transcriptomic, proteomic, metabolomic, etc.) data. With the prevalence of high-throughput sequencing technologies, RNA sequencing (RNA-seq) data have become a highly abundant data type for studying gene expression and RNA processing. One of the key steps in RNA processing is RNA splicing, which is the removal of intronic regions and the joining of exonic regions of precursor mRNA [1]. Moreover, alternative splicing can generate multiple transcript and protein isoforms from individual gene loci, greatly enhancing the regulatory and phenotypic diversity of eukaryotic organisms [2]. Using RNA-seq, researchers can quantify splicing events by counting the number of reads that map to individual splice junctions.
Although gene expression analysis pipelines are often a standard part of bioinformatics analysis, with many publicly available cloud workflows, cloud-based alternative splicing analysis tools remain limited. rMATS-turbo [3], which has been maintained by our lab since 2017 after its predecessor rMATS was published in 2014 [4], discovers and quantifies alternative splicing events from large-scale RNA-seq data. We recently developed rMATS-cloud, a portable version of the rMATS-turbo workflow that can be run in virtually any cloud environment suited for biomedical research. rMATS-cloud handles RNA-seq datasets with thousands of samples, making it ideally suited for the storage capacities of many cloud data repositories which already house large RNA-seq datasets.
Implementation
rMATS-cloud currently supports three central workflow languages: Workflow Description Language (WDL) [5], Common Workflow Language (CWL) [6], and Nextflow [7] (Table 1). This flexibility allows users to integrate rMATS-cloud into existing workflows on various platforms, including Terra [5], CAVATICA (https://cavatica.sbgenomics.com/), Cancer Genomics Cloud [8], and more. These workflows are available on Dockstore [9] and GitHub (https://github.com/Xinglab/rmats-turbo/).
Cloud environments that support these workflow languages include Cancer Genomics Cloud and CAVATICA for CWL, AnVIL, Biodata Catalyst, DNAnexus, and Terra for WDL, and Seqera for Nextflow. These platforms share key features such as scalability, reproducibility, containerized execution, and compatibility with HPC systems. The primary distinction between them lies in the workflow language each supports: CWL, WDL, or Nextflow. Among these, WDL is known for its relatively readable syntax, whereas CWL can be more complex to read. Nonetheless, the high portability of these workflow languages enables them to meet a wide array of computational needs.
As shown in Figure 1, rMATS-cloud starts with BAM files for each relevant RNA-seq sample or sample group, and a GTF file for gene and transcript annotations. Users can modify any of the input parameters through the workflow config files, including read length, novel splice site detection (on/off), and choice of statistical method. In the rMATS-cloud workflow, the rMATS-turbo prep step is run on each BAM file to compute splicing graphs for each gene. The prep step can be parallelized across machines in the cloud platform to utilize available resources. Once the prep steps are complete, the post step combines the splicing graphs from all of the prep steps and quantifies alternative splicing events by type. While downstream analysis can be done in the cloud, the output files are relatively small compared to the input BAM files, making it convenient to download them for local analysis and visualization if preferred.
Schematic of rMATS-cloud workflowrMATS-cloud currently supports three main workflow management systems: WDL, CWL, and Nextflow. The rMATS-cloud workflow takes BAM files for RNA-seq data and a GTF file for gene and transcript annotations, in the form of cloud platform paths or a compiled data file. Next, the rMATS-turbo workflow runs end to end with its prep step and post step, yielding output files that can be downloaded to a local file system for downstream analysis. A mock dataset was used to generate the heatmap shown. WDL, Workflow Description Language; CWL, Common Workflow Language; RNA-seq, RNA sequencing.
Results
We ran rMATS-cloud on (1) a small-scale RNA-seq dataset from 2 prostate cancer cell lines with contrasting epithelial versus mesenchymal phenotypes, each with three biological replicates [10] (EMT Dataset; SRA: PRJNA438990) and (2) a large-scale RNA-seq dataset of pediatric acute myeloid leukemia (AML) patient samples from the Children’s Oncology Group (COG Dataset, https://cavatica.sbgenomics.com/u/kids-first-drc/sd-pet7q6f2; AAML1031 clinical trial, NCT01371981-D4). We compared the time and cost of running rMATS-cloud on three different platforms (Cavatica, Terra, and Seqera). The results are shown in Table 2. To test the EMT Dataset, we ran three runs of the same workflow and averaged the cost and time of the runs. Six BAM files for the EMT Dataset were processed in 15.3–49.7 min for 5–25 cents. Because the COG Dataset is securely housed within the Cavatica platform, we only tested it on one platform with one run to cut costs. In Cavatica, rMATS-cloud processed 1113 BAM files in 5 h and 41 min for USD $38.70.
Discussion
Here, we present a solution for processing large-scale RNA-seq data for alternative splicing analysis in the cloud. rMATS-cloud is now available in three different workflow languages—CWL, WDL, and Nextflow—and can be run in diverse cloud platforms, including widely used platforms crucial to this era of collaborative genomics research. We show that rMATS-cloud processes a large pediatric AML dataset of 1113 BAM files in less than 6 h for less than $40. While the HPC version of rMATS-turbo itself is a fast and versatile software, rMATS-cloud provides the same robust performance for datasets which are stored on the cloud without requiring the data to be downloaded. rMATS-cloud saves space and time, by allowing data to be analyzed where it is stored. This broadens accessibility to a global audience, which is increasingly leveraging the cloud to analyze genomics data in a highly collaborative environment.
While the rMATS-turbo software does not directly operate on single-cell or spatial RNA-seq data, it has been adapted for these applications [11,12]. Therefore, with additional customization, we envision that rMATS-cloud can be adapted for cloud-based alternative splicing analysis of single-cell and spatial RNA-seq datasets.
Some additional considerations when working with biomedical data in a cloud environment include managing privacy and security. For researchers looking to house large-scale data, many platforms such as the Cancer Genomics Cloud [8] allow users to perform genomic analysis in workspaces that can be kept private or shared with collaborators. Although the responsibility of proper data stewardship and governance is up to users, many platforms provide consistent and secure frameworks for securing data where it is stored.
rMATS-cloud expands the repertoire of available cloud analysis tools for RNA-seq data, enhancing data sharing and reuse by better supporting bioinformatics analysis on collaborative platforms. Each cloud platform has a graphical user interface that allows users to customize workflow configuration settings and manage their large-scale data without extensive computational expertise. rMATS-cloud workflows are available to download from Dockstore and GitHub (https://github.com/Xinglab/rmats-turbo/).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Berk AJ. Discovery of RNA splicing and genes in pieces. Proc Natl Acad Sci U S A 2016;113:801–5.26787897 10.1073/pnas.1525084113 PMC 4743779 · doi ↗ · pubmed ↗
- 2Wright CJ , Smith CWJ, Jiggins CD. Alternative splicing as a source of phenotypic diversity. Nat Rev Genet 2022;23:697–710.35821097 10.1038/s 41576-022-00514-4 · doi ↗ · pubmed ↗
- 3Wang Y , Xie Z, Kutschera E, Adams JI, Kadash-Edmondson KE, Xing Y. r MATS-turbo: an efficient and flexible computational tool for alternative splicing analysis of large-scale RNA-seq data. Nat Protoc 2024;19:1083–104.38396040 10.1038/s 41596-023-00944-2 · doi ↗ · pubmed ↗
- 4Shen S , Park JW, Lu ZX, Lin L, Henry MD, Wu YN, et al r MATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data. Proc Natl Acad Sci U S A 2014;111:E 5593–601.25480548 10.1073/pnas.1419161111 PMC 4280593 · doi ↗ · pubmed ↗
- 5Voss K , Gentry J, Van der Auwera G. Full-stack genomics pipelining with GATK 4 + WDL + Cromwell. F 1000 Res 2017;6:1379.
- 6Crusoe MR , Abeln S, Iosup A, Amstutz P, Chilton J, TijanićN, et al Methods included: standardizing computational reuse and portability with the Common Workflow Language. Commun ACM 2022;65:54–63.
- 7Di Tommaso P , Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol 2017;35:316–9.28398311 10.1038/nbt.3820 · doi ↗ · pubmed ↗
- 8Lau JW , Lehnert E, Sethi A, Malhotra R, Kaushik G, Onder Z, et al The Cancer Genomics Cloud: collaborative, reproducible, and democratized—a new paradigm in large-scale computational research. Cancer Res 2017;77:e 3–6.29092927 10.1158/0008-5472.CAN-17-0387 PMC 5832960 · doi ↗ · pubmed ↗
