Metapipeline-DNA: A comprehensive germline and somatic genomics Nextflow pipeline
Yash Patel, Chenghao Zhu, Takafumi N. Yamaguchi, Nicholas K. Wang, Nicholas Wiltsie, Nicole Zeltser, Alfredo E. Gonzalez, Helena K. Winata, Yu Pan, Mohammed Faizal Eeman Mootor, Timothy Sanders, Sorel T. Fitz-Gibbon, Cyriac Kandoth, Julie Livingstone, Lydia Y. Liu

TL;DR
Metapipeline-DNA is an automated, flexible pipeline for analyzing DNA sequencing data to extract genetic and evolutionary features from both germline and somatic genomes.
Contribution
The paper introduces Metapipeline-DNA, a comprehensive and extensible Nextflow pipeline for scalable genomic analysis with robust automation and cloud compatibility.
Findings
Metapipeline-DNA supports analysis of both germline and somatic DNA sequencing data.
The pipeline is optimized for scalability, reproducibility, and consistent quality control.
It integrates multiple algorithms for detecting nuclear, mitochondrial, and evolutionary features.
Abstract
The price, quality, and throughput of DNA sequencing continue to improve. Algorithmic innovations have allowed inference of a growing range of features from DNA sequencing data, quantifying nuclear, mitochondrial, and evolutionary aspects of both germline and somatic genomes. To automate analyses of the full range of genomic characteristics, we created an extensible Nextflow metapipeline called metapipeline-DNA. It analyzes targeted and whole-genome sequencing data from raw reads through preprocessing, feature detection by multiple algorithms, quality control, and data-visualization. Each step can be run independently and is supported by robust software engineering including automated failure-recovery, granular testing, and consistent verifications of inputs, outputs, and parameters. Metapipeline-DNA is cloud-compatible and highly configurable, with options to subset and optimize each…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCRISPR and Genetic Engineering · Molecular Biology Techniques and Applications · Chromosomal and Genetic Variations
Introduction
High-throughput technologies have made biomedical research increasingly data-intensive. DNA sequencing is a key enabling technology, used both in routine clinical care and to support a wide range of research studies.1 Ongoing improvements in DNA sequencing continue to reduce costs and enable new discoveries, like elucidation of complex structural variants (SVs) and repetitive genomic regions by long-read sequencing.2 Modern germline DNA sequencing studies routinely quantify single-nucleotide polymorphisms (SNPs), SVs, telomere length, mitochondrial copy number and variation, copy number, and many other features.3^,^4^,^5
DNA sequencing has been especially helpful in characterizing tumors. In many studies both a sample of a cancer tissue and a “reference” normal sample from the same individual are sequenced to better distinguish somatic from germline variation and enable analysis of germline-somatic interactions. Cancers often exhibit widespread genomic rearrangements and clonal variation in mutation burden, specific patterns of somatic mutations associated with carcinogens or other features and a host of features absent or uncommon in germline genomes, like kataegis and chromothripsis.6 Comprehensive analyses of cancer sequencing can improve diagnosis, prognosis, and management.7^,^8
The growing availability of DNA sequencing has been paralleled by rapid development and adoption of both specific algorithms and workflow software. New discoveries often rely heavily on complex workflows comprising a mixture of established and novel algorithms.9 These workflows, often termed “pipelines,” are implemented in a range of orchestration frameworks including Galaxy,10 Snakemake,11 Common Workflow Language (CWL),12 and Nextflow.13 Workflows provide a way to automate processes by minimizing manual handling of data flow and facilitating stitching together of different tools to process raw data into refined forms such as lists of variants or quantitation of specific features.
The use of complex workflows has placed a growing emphasis on standardization, extensibility, quality control, and compute infrastructure. Workflow implementations routinely differ across research groups, with many groups creating their own. Many workflows lack key features like unit testing, integration testing, error-handling, fault-tolerance, input-output verification, quality-control, data-visualization, and use of multiple algorithms to create consensus calls.14 Given the volume of data and the expense of compute, workflows are often bespoke to the high-performance computing environment used by a single group.15 Portability of workflows to new environments is part of the “model to data” (M2D) paradigm in data sharing and processing.16 M2D overcomes the cost, time, and privacy risks of data-transfer by bringing models or algorithms to the computing system where data are stored. M2D thus necessitates that models be portable across providers and environments to support workflow usage in conjunction with good data management principles hinging on findability, accessibility, interoperability, and reusability.17
To address the need for a robust open-source DNA sequencing analysis pipeline, we created metapipeline-DNA. This Nextflow metapipeline is highly customizable and is capable of processing data from any stage of analysis. It can process DNA sequencing data starting from raw reads through alignment and recalibration, variant calling and even highly integrated analyses likely tumor subclonal reconstruction. Extensive quality control, testing, and data-visualization are built into each individual step and into the full metapipeline. It can work on multiple compute systems and clouds, facilitating analyses at any scale.
Results
Overview
Metapipeline-DNA is a Nextflow metapipeline for the analysis of DNA sequencing data. It can analyze both targeted and whole-genome sequencing with 16 pipelines (Table 1) that collectively transform raw sequencing reads into sets of detected variants and other genetic and evolutionary features (Figure 1A). Most individual pipelines execute multiple algorithms and create consensus calls. For example, subclonal copy number aberration (CNA) detection uses two algorithms (FACETS and Battenberg) and produces visualizations including logR and B-allele frequency (BAF) plots (Figure 1B). Similarly four separate algorithms can be executed for somatic single-nucleotide variant (SNV) detection,14 automatically generating a consensus set of predictions and variant-associated data-visualizations (Figures 1C and 1D). Each pipeline can be executed independently and can be extensively parameterized to customize the selection and tuning of algorithms.Table 1. Metapipeline-DNA constituent pipelinesPipelineInput formatsOutput artifactsAlgorithmsFeaturesConvert-BAM2FASTQBAM/CRAMFASTQSAMtoolsautomatic detection of and conversion from CRAM to BAM before reversion to FASTQ in the event of CRAM input support for both BAM and CRAM inputsAlign-DNAFASTQBAMBWA-MEM2 HISAT2duplicate markingCalculate-targeted-coverageBAM target region BEDexpanded regions per-base depth in target regions and dbSNP siteshybrid-selection metricsSAMtools BEDtoolsautomatic expansion of regions to off-target dbSNP loci with coverageRecalibrate-BAMBAM target regionsINDEL realigned and base-quality score recalibrated BAMGATKsupport for target regions local INDEL realignmentbase-quality score recalibrationGenerate-SQC-BAMBAMBAM statisticscoverage metricsSAMtools PicardQualimapcustomizable selection of QCcoverage reporting and visualizationCall-gSNPBAM target regionsper-sample GVCF germline SNP VCFGATK DeepVariantvariant quality score recalibrationambiguous variant filtrationCall-mtSNVBAM/CRAMmitochondrial SNV VCFMToolBox mitoCallermitochondrial read extraction support for BAM and CRAMheteroplasmy callingCall-gSVBAMgermline SV VCF germline SV BCFDELLY Mantagermline CNV calling variant call QCCall-sSVBAMsomatic SV VCF somatic SV BCFDELLY Manta SVisiongermline SV filtrationCall-sSNVBAM somatic SNV calls****panel of normalsomatic SNV VCFsMutect2 Strelka2SomaticSniperMuSE DeepSomaticBCFtools-Intersectsupport for panel of normals tumor-only modemulti-tumor mode consensus callset and visualizationvariant allele frequency distribution by callsetCall-sCNABAMsomatic CNA VCF or TSVBattenberg FACETSstandardized visualizationOption for customizing Battenberg refit suggestionsCall-SRCSNV callsCNA callsSNV clusteringreconstructed phylogenyPyClone PyClone-VIPhyloWGS DPClustFastClone CliPCONIPHERCustomizable combinations of clustering algorithm and phylogeny algorithmstandardized clustering and phylogeny formatsStableLiftvariant calls (gSNP, sSNV, sSV, gSV)lifted variant calls in target reference genomeStableLiftcustomizable direction of liftovercustomizable selection of model for stability predictionCall-GeneticAncestrygermline variant callscalled genetic ancestryADMIXTUREPLINK2support for VCF and PLINK inputsAnnotate-VCFvariant callsAnnotated variant callsSnpEffFuncotatorVEPsupport for variant normalizationcustomizable selection of annotation databasesCalculate-mtDNA-CopyNumbersample coveragemitochondrial copy number calculated based on coveragesupport for extensible range of coverage sourcesPipelines encompassed within metapipeline-DNA and their inputs, outputs, algorithms, and key features. Inputs that are bolded are optional and inputs separated by “/” represent a list of choices from which one must be chosen.Figure 1. Data flow and visualizations(A) Data flow through metapipeline-DNA, highlighting the available analyses and modes (somatic, germline, etc.). The different available entry points into metapipeline-DNA are indicated to show flexibility of pipeline selection and input format.(B) Normalized tumor coverage relative to the matched normal (log_2_R) and the B-allele frequency of individual SNPs laid out across the genome to support CNA detection.(C) Variant calls from 4 SNV callers: MuSE2, SomaticSniper, Strelka2, and Mutect2. Intersection diagram shows consensus sets of variants between the callers.(D) Variant allele frequencies (VAFs) based on consensus status between callers. VAFs are indicated for all combinations of consensus between one-, two-, three-, and four-variant callers, with each data point representing one combination. The adjusted VAF is calculated as an average of all variants present in the combination.(E) Input sample combinations supported by metapipeline-DNA. The nT-mN (e.g., 2T-2N) combination indicates any arbitrary numbers of normal and tumor samples. Each combination is automatically detected based on the provided inputs and considered during processing of all pipelines to select appropriate algorithms and processing modalities.(F) Automatic and customizable interval usage in metapipeline-DNA. Original intervals undergo assessment and expansion at dbSNP sites with high coverage to produce expanded intervals that cover high-coverage off-target regions. The option of using the expanded or the original intervals for downstream BAM recalibration and variant calling is parameterized with both options automated.
Several different sample run-modes are available, which we denote with the terminology nT-mN, where n indicates the number of tumor samples and m the number of reference samples (Figure 1E). Thus, classic paired tumor-normal analysis is 1T-1N. Metapipeline-DNA fully supports modes like 0T-1N (i.e., germline DNA sequencing), 0T-3N (e.g., family trios), 1T-0N (i.e., unpaired tumour-only sequencing), and arbitrary multiregion tumor sequencing (e.g., 5T-1N). The primary limitation to multisample analyses is compute resource availability—particularly RAM and scratch-disk space. Metapipeline-DNA automatically handles input types for each mode and only executes feasible pipelines, independent of user-selections. For example, in 0T modes, variant detection is restricted to germline variants without users having to provide manual guidance or parameterization.
The default mode of metapipeline-DNA accepts unaligned reads in FASTQ18 format and executes all pipelines. A range of alternative entry-points are possible, including starting from an unaligned BAM, an aligned BAM,19 or from CRAM files, with automatic BAM-to-FASTQ conversions as needed. A few pipelines accept alternative entry-points, such as SNV and CNA calls for tumor subclonal reconstruction,20 germline SNPs for determination of genetic ancestry and germline SNPs or somatic SNVs for variant functional annotation (Figure 1A). Documentation of all dependencies, input and output formats is available on standardized structured GitHub pages: current states at writing are summarized in Table S1.
We engineered metapipeline-DNA to be intrinsically flexible. All dependencies are automatically identified and executed. All run-modes and settings defaults set to the most common behavior across thousands of runs, with easy re-parameterization. For example, when input data are already aligned the default is to use these alignments. Configuration parameters allow the user to control whether reads are converted to FASTQ and re-aligned and whether aligned reads are recalibrated and so forth.
In a similar way, metapipeline-DNA is flexible to the specific genome build used and has been tested extensively with GRCh37, GRCh38, and GRCm39. It can run in two modes: WGS mode and targeted-sequencing mode, based on user parameterization. Targeted-sequencing mode supports all subsets of the genome, including exomes and arbitrary panels. Options are available to assess coverage, evaluate covered off-target sites and to automatically use expanded target intervals for downstream processing (Figure 1F).
Data visualization and quality-control
Metapipeline-DNA includes a range of quality control steps and pipelines to assess data at each step, including raw reads, alignments, and variant calls. The pipeline for back-conversion from BAM/CRAM to FASTQ includes built-in checks like SAM flag and alignment statistics assessment and read count comparison before and after conversion to FASTQ (to ensure no loss of reads due to file corruption or parallelization scatter-gather failures, for example). These quality controls produce a variety of data-visualizations and reports. For example, alignment quality is inferred from BAM (or CRAM) files in a range of ways including coverage distributions over the genome (or target region with or without padding; Figures 2A and 2B). Reads are quantified by a range of quality metrics, including total counts, mapping qualities, guanine cytosine (GC) content, insert sizes, read lengths, duplications, and others. Figure 2C shows an example of read number stratified by a range of quality groupings. A range of software are used to generate these metrics, including SAMtools,19 Picard,21 and Qualimap.22 Pileup summaries at common sites are generated and used as a precursor to estimate contamination across samples. Visualization is also built into the SV calling pipelines to produce representations of SVs and their categorization (inversion, insertion, and breakend/translocation) in circos plots (Figure 2D).Figure 2. Alignment and coverage metrics(A) Percent of bases in the genome at each fold of coverage for normal and tumor samples (each line represents one sample) for all five WGS PCAWG patients. Each line represents a different sample, with the percentage of bases calculated using the coverage metrics.(B) Distribution of mean and median coverage across all samples, highlighting two rough separations arising from the normal samples and tumor samples with the normal samples being in the lower coverage separation.(C) Distributions of reads across alignment metrics including mapped/unmapped, low mapping quality, duplication, and paired.(D) Circos plot of somatic SVs categorized into inversions, insertions, and breakend/translocations.(E) TP, FN, and FP variant calls comparing raw SNP and INDEL calls from HaplotypeCaller- and XY-filtered variant calls against the GIAB HG002 truth set. Numbers represent the number of variant calls with reduction in false discovery rate with XY filtration highlighted in red.(F) Time and memory usage of pipelines per sample for the three different processing cohorts: PCAWG with GRCh38, PCAWG with GRCh37, and TCGA with GRCh38. Time is measured as wall-clock time for each pipeline and the total time taken by metapipeline-DNA. Memory is measured as the peak RAM usage in GB by any single process by any pipeline.(G) TP, FN, and FP variant calls comparing consensus call-SNV calls from the PCAWG-5 samples against a set of validation variant calls made from targeted deep-sequencing of the same samples. Numbers represent the number of variant calls.(H) Reconstructed phylogeny of tumor samples SA478344, SA528788, and SA528876 using a consensus SNV callset comprising variants called by at least two out of four SNV callers (MuSE2, SomaticSniper, Strelka2, and Mutect2) and FACETS CNAs. Nodes represent identified subclones with the evolutionary history depicted over SNV accumulation. Along the x axis is the cellular prevalence (CP), indicating the fraction of all cells comprising each subclone.
In targeted-sequencing mode, additional coverage assessment is performed through per-base read depth calculations at target regions and well-characterized off-target polymorphic sites provided from dbSNP.23 The workflow also generates an expanded set of targets encompassing the original target regions plus user-defined polymorphic sites (typically dbSNP) enriched in coverage over a user-defined threshold. Metapipeline-DNA provides with configuration to automatically use the expanded targets with BAM recalibration and variant calling pipelines.
Variants are additionally evaluated for stability across reference genomes: StableLift24 is available as an optional workflow to support liftover of sSVs, gSVs, sSNVs, and gSNPs between GRCh37 and GRCh38. In addition to liftover, the pipeline annotates variants with databases such as dbSNP23 and applies a model to assign a stability score to each variant to indicate the likelihood of the variant being consistently represented across the two reference genome builds. Variant type-specific assessment is also performed. For example, germline SNP calls undergo filtration using models built from variant quality scores for both SNPs and Indels. Somatic SNVs are assessed based on consensus between callers and associated variant allele frequencies. The consensus approach across callers allows for filtering of SNV calls to reduce the rate of false positives made by a single caller.
Germline SNP calls undergo genetic sex-specific evaluations to reduce the rate of false positives. Variants on chromosomes X and Y in non-pseudo-autosomal regions25^,^26^,^27 (PARs) are extracted and filtered based on the genetic sex. In XY samples, heterozygous genotype calls are removed, and homozygous genotype calls are converted to hemizygous. In XX samples, all chromosome Y variant calls are removed. This reduces the false positive rate of variant calls made on the sex chromosomes.
Software-engineering and pipeline robustness
We placed a heavy focus on generating reusable and extensible software that could automatically detect and recover from common errors, particularly in the compute environment. This led us to adopt or create a series of development practices and pipeline features aimed at maximizing quality. All software is open-source, available on GitHub (https://github.com/TheBoutrosLab/metapipeline-DNA, with older versions available at https://github.com/uclahs-cds/metapipeline-DNA), with transparent tracking of issues and discussions. Development followed a test-driven approach using NFTest.14 Metapipeline-DNA has a suite of 95 unit, integration, and regression tests that are run for each new release with testing performed for different stages of execution from end-to-end tests to individual pipeline tests. The tests utilize publicly available simulated sequencing data from the ICGC-TCGA DREAM Somatic Mutation Calling Tumor Heterogeneity (SMC-Het) Challenge28 subsampled at various sequencing depths to facilitate different tests. Our extensive use of Docker containers allows seamless co-existence of multiple pipeline versions, and the combination of automated testing and containerization facilitates rapid updating with new features or dependency versions. Standardized GitHub issue templates support robust reporting of both bugs and new feature-requests, allowing ideal collaboration (Figure S1A). At writing, development has involved 43 contributors making 1,408 pull-requests and 46 individuals making 1,124 suggestions, feature-requests, and issue-reports across 17 repositories.
Bioinformatics data have high intrinsic variability, and bioinformatics software can be prone to significant numbers of failures—particularly in heterogeneous computing environments. Failure handling is built into metapipeline-DNA to predict and minimize wasted computation. We automated input and parameter validation to catch issues prior to commitment of compute resources.29 Proactive validation of pipeline parameters is implemented to avoid errors prior to resource commitment. Individual pipelines are modularized and fault-tolerant such that errors or failures in one pipeline stay isolated from and do not terminate other pipelines that are not direct dependencies. Metapipeline-DNA can be easily re-run in cases of failure, triggered starting from prior partial results with a simple parameterization.
All outputs are organized with standardized directory and naming structures (Figure S1B). Filenames have been standardized to provide dataset, organism, and sample information in a consistent way across pipelines. Metapipeline-DNA similarly organizes log-files to ensure saving of and ready access to the metapipeline-DNA logs, individual pipeline-level logs, and compute partition logs. These logs capture execution and resource usage metrics for every process. Robust tooling has been developed around process and pipeline execution to ensure logs are captured for both successful and failing steps to enable debugging and record-keeping. Scripts have been created that automatically “crawl” over a series of pipeline runs to extract and tabulate information about run success, compute resources, and other features.
Compute infrastructure
Metapipeline-DNA includes compute-agnostic customizability of execution and scheduling in distributed workflows. It has been tested and validated on both the Azure and AWS clouds. Execution follows the pattern of a single leading job responsible for submission and monitoring of per-sample or per-patient analysis jobs. Execution is currently performed with the Slurm executor with optional specification of compute partitions.30 Parameters also exist to control rate of job submission and amount of parallelization/resources usage. Once configured and submitted, metapipeline-DNA automatically handles processing of an entire cohort with input parsing and job submission without user intervention. Real-time monitoring is available through email notifications sent from a server watching individual step start, end and status. The choice of executor itself is parameterized and can be easily extended to new environments.
Metapipeline-DNA includes optimizations for disk usage, with optional eager intermediate file removal and built in checks to optimize disk usage (e.g., performing I/O operations from high-performance working disks). Resource allocation for individual steps is automatically handled, with pipelines running in parallel as available resources allows. Resource-related robustness is also built into pipelines to detect memory allocation failures from individual tools and automatically retry processes with higher allocations.
Case studies
We assessed the performance of germline SNP XY filtration using the Genome in a Bottle (GIAB) HG002 sample.31 Variant calls generated by pipeline-call-gSNP were assessed against the GIAB HG002 XY small variant benchmark v1.0 as the truth set.31 Compared with raw variant calls our XY pipeline decreases false positive INDEL calls from 1,347 to 221 (FDR from 0.054 to 0.009). Similarly, it reduces false-positive SNPs detections from 7,563 to 1,290 (FDR from 0.083 to 0.015). True positive and false negative rates were unchanged, demonstrating improved precision without reduced sensitivity (Figure 2E).
As an additional demonstration and benchmark, 10 normal-tumor pairs were processed through the entirety of metapipeline-DNA. Five pairs were selected from the Pan-Cancer Analysis of Whole Genomes (PCAWG)32 63 dataset and another five from The Cancer Genome Atlas (TCGA).33 The PCAWG-63 samples were sequenced with whole-genome sequencing and derived from multiple cancer types: one from uterine corpus endometrial carcinoma, one from biliary tract carcinoma, and three from esophageal adenocarcinoma. Tumor samples had a median coverage of 63× (range: 45–65×) and normal samples had a median coverage of 38× (range: 34–54×). The TCGA samples were derived from soft tissue sarcoma samples sequenced with exome-targeted sequencing. Both pairs were processed using metapipeline-DNA from alignment to subclonal reconstruction. PCAWG-63 samples were processed with both GRCh38 and GRCh37, with similar runtimes across the two reference builds at an average of 81.2 h (95% CI: ±14.2) for GRCh38 and 83.4 h (95% CI: ±13.0) for GRCh37. Processing was performed in Azure using Fsv2 instances with 72 CPUs and 144 GB of RAM, with a cost of 1.12 per hour. The average cost for processing the PCAWG samples was 92.18 at the reserved instance rate. The average cost for TCGA sample was 6.78 at the reserved instance rate. Across the 10 pairs, memory usage peaked in call-sSNV (average ± 95% CI: 48.5 GB ± 2.3 and 29.3 GB ± 3.8 for PCAWG63 GRCh37 and TCGA GRCh38, respectively) and in align-DNA (average ± 95% CI: 51.4 ± 5.1 for PCAWG63 GRCh38). Runtimes and peak memory usage of metapipeline-DNA are visualized in Figure 2F and summarized in Table S2. Both run-times and memory usage are a function of compute hardware and parameter selection and can be extensively tuned.
We assessed our consensus-based sSNV calling workflow and its consensus callset using the PCAWG samples with targeted deep-sequencing (mean coverage of 653×) validation as a truth set. The true positive rate (TPR) for the samples ranged from 0.87 to 0.93 with false discovery rate (FDR) ranging from 0.05 to 0.11 (Figure 2G). To demonstrate phylogenetic reconstruction, we subsampled 5,000 SNVs and used CNA calls (average 138 per sample; Figure 2H). Variant allele frequencies aggregated over all combinations of consensus calls are shown for all samples in Figure S2A.
Benchmarking analysis
Three pipelines were chosen for comparison against metapipeline-DNA: nf-core/sarek,34 Sentieon,35 and DRAGEN.36 A feature comparison covering accepted inputs, types of analyses available, execution requirements, and infrastructure requirements is detailed in Table 2.Table 2. Comparison of metapipeline-DNA with nf-core/sarek, Sentieon, and DRAGENFeatureMetapipeline-DNAnf-core/sarekSentieonDRAGENInput types acceptedFASTQ, BAM, CRAMFASTQ, BAM, CRAMFASTQ, BAM, CRAMFASTQ, BAM, CRAMTypes of variants calledSNVs, Indels, gSNPs, gSVs, sSVs, CNAs, mtSNVs, Genetic AncestrySNVs, Indels, gSNPs, gSVs, sSVs, CNAsSNVs, Indels, gSNPs, gSVs, sSVs, CNAa, repeat expansions, mtSNVsSNVs, Indels, gSNPs, gSVs, sSVs, CNAs, repeat expansions, mtSNVsSubclonal reconstructionyesnononoFollows GATK best practices?yesyesyesnoContainerization?yes, with Dockeryes, with Docker and SingularitynonoInfrastructure requiredflexible with Nextflow workflow manager, runs on CPU architectureflexible with Nextflow workflow manager, runs on CPU architectureflexible with self-contained executables, runs on CPU architectureproprietary software that requires a combination of CPU and FPGA hardwareScalabilityhighly scalable through local machines and cloud-based job schedulers for horizontal scalinghighly scalable through local machines and cloud-based job schedulers for horizontal scalingindividual software is highly optimized for vertical scalinghighly scalable vertically through hardware optimizationEase of executionsingle-command execution of cohortsingle-command execution of cohortcommand-line tools must be scripted into a workflow for executionsingle-command execution of samplesConfigurabilityselection of tools and parameters through command-line arguments and configuration fileselection of tools and parameters handled through command-line arguments and configuration filesselection of tools and parameters handled through specific calls to software and command-line argumentsselection of tools and parameters through command-line argumentsDocumentationstandardized, comprehensive documentation with usage, parameter, input, and output descriptionsstandardized, extensive, well-maintained with usage, parameter, input, and output descriptionscomprehensive documentation with usage, parameter, input, and output descriptionscomprehensive documentation and suggestions with usage, parameter, input, and output descriptionsList of genomics pipeline features and their presence and evaluation in each compared pipeline.
We performed an in-depth benchmarking between nf-core/sarek and metapipeline-DNA using the PCAWG-5 and GIAB HG002 samples. For the PCAWG-5 samples, somatic variants were called using nf-core/sarek and their performance was evaluated through comparison against a validation variant call set derived from targeted deep-sequencing of the same samples. The performance of calls from nf-core/sarek and from metapipeline-DNA is detailed in Figures S2B and S2C, respectively. Both pipelines demonstrate similar performance, with F1 scores ranging from 0.870 to 0.909 for nf-core/sarek calls and 0.853 to 0.908 for metapipeline-DNA calls. For each of the pipeline runs, the memory usage peaked at 83.8 GB for nf-core/sarek and 54.1 GB for metapipeline-DNA. Performance on germline variant calling was also compared using the HG002 sample. Germline variant calls from nf-core/sarek and metapipeline-DNA were compared to the GIAB HG002 XY small variant benchmark as the truth set, with metrics detailed in Figure S2D. Variant calls made using metapipeline-DNA show reduced FDR compared to nf-core/sarek’s calls (0.058–0.008 for INDELs and 0.086 to 0.006 for SNPs) and consistently higher F1 scores (0.943–0.969 for INDELs and 0.941 to 0.981 for SNPs).
Discussion
Metapipeline-DNA was designed to facilitate analysis of DNA sequencing data at scale while retaining the configurability and flexibility needed in academic environments. This is a key contrast to field programmable gate array (FPGA) approaches such as DRAGEN, which attain outstanding speed through hardware optimization at the expense of algorithmic flexibility and evolution. As the field of genomics evolves, the ability to quickly integrate and test emerging methods continues to be extremely important, highlighting a limitation of fixed-function hardware solutions.
Metapipeline-DNA fills this key niche of supporting the rapidly expanding volume of sequencing data, supporting a range of existing tools and algorithms and remaining flexible for ongoing expansion. By easing and optimizing the multistep analyses intrinsic to DNA sequencing data, it reduces the barrier to incorporating new methods and analyzing large datasets. Indeed, it is entirely feasible for metapipeline-DNA to leverage and incorporate FPGA-enabled and graphics processing units (GPU)-accelerated methods directly as part of its modular structure (e.g., for alignment); this is a key area of ongoing development.
Individual pipelines within metapipeline-DNA are modular, creating a plug-and-play architecture that can be adapted to support additional technologies as they become available. Algorithms and workflows for processing long-read data, for example, pose an avenue for expanding the meta-pipeline as such tools mature and long-read datasets become more common. The context of DNA also brings up the possibility of similar meta-pipelines for other biological molecules such as RNA and proteins. Workflows across different biomolecules can share the architecture, automation, and quality-control of metapipeline-DNA in a way that allows improvements to any single pipeline to improve the others. Such workflows are currently under development to provide a similar level of configurability and extensibility for analyses of RNA and protein data.
The large volume of data and size of individual samples in sequencing studies creates a need for optimization of analysis pipelines’ data handling. metapipeline-DNA contains several disk usage optimizations to efficiently handle large amounts of data while minimizing I/O operations and cross-file system data movement. The framework connecting analyses automatically identifies necessary outputs from dependent pipelines and makes it available without any redundant copying or duplication. There are additional enhancements that are underway to minimize duplicated data and disk usage of metapipeline-DNA by building plugins to enable moving of files rather than copying when possible and optimizing individual pipelines to avoid shuffling around large output files.
Metapipeline-DNA is a highly customizable DNA sequencing analysis pipeline combining speed and flexibility in a modular framework to enable processing of data at any point from read alignment to tumor subclonal reconstruction. By facilitating the integration of diverse tools and supporting the rapid development of new methodologies, it positions itself as a versatile platform for future enhancements as novel DNA sequencing and analysis methods are developed.
Limitations of the study
The on-going improvements in DNA sequencing technologies and acceleration in data volume necessitate optimization in data and resource handling and integration of novel algorithms. Analysis of sequencing data involves several distinct steps ranging from alignment of raw reads to a reference genome to variant calling to variant annotation and tumor evolution reconstruction. The compute resources required at each step vary, with alignment for example requiring high CPU and disk space usage and variant annotation requiring far fewer resources. To efficiently handle the requirements of different steps while maximizing utilization of available resources, metapipeline-DNA will employ a dynamic allocation selection algorithm trained on thousands of analyses that will leverage sequencing data features, such as read length, number of reads and read quality, to automatically determine and allocate necessary compute resources while avoiding waste.
Although SNV calling has been benchmarked against nf-core/sarek, it does not provide end-to-end accuracy and performance assessment for other analyses, including SVs and CNAs. The tools used for calling these variants have been benchmarked individually previously and are employed in metapipeline-DNA. Combined performance benchmarking for these features will be expanded in future developments.
In addition, sequencing processing often employs cohort-level studies combining information a set of individuals to identify patterns related to disease, evolution, and ancestry. To enhance metapipeline-DNA’s ability to handle cohort-level analysis, it will employ a split-apply-combine principle to process individuals in parallel before combining results joint genotyping, population ancestry background prediction and variant frequency across populations.
Resource availability
Lead contact
Further information and requests for resources and code should be directed to and will be fulfilled by the lead contact, Paul C. Boutros ([email protected]).
Materials availability
This study did not generate new unique reagents.
Data and code availability
- •The data underlying this article were accessed from the European Genome Phenome Archive (EGA; https://ega-archive.org/search/pcawg under accession number EGAS00001001692) and the Genomics Data Commons (GDC; https://portal.gdc.cancer.gov/ under dataset ID TCGA). The derived data generated in this research will be shared on reasonable request to the corresponding author.
- •Source code has been deposited here: https://doi.org/10.5281/zenodo.15085443. It is also available on GitHub (https://github.com/TheBoutrosLab/metapipeline-DNA with older versions available at https://github.com/uclahs-cds/metapipeline-DNA).
- •Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
Acknowledgments
The authors gratefully acknowledge the ongoing support of all present and past members of the Boutros lab and Andrew Park in providing suggestions, practical use-cases, and support. The authors also acknowledge the Office of Health Informatics and Analytics at UCLA Health IT for their infrastructure support, particularly high-performance compute provisioning, data management, and resource tuning. This study was conducted with the support of the 10.13039/100000002National Institutes of Health through awards (R01CA268380, P30CA016042, R01CA244729, R01CA270108, U2CCA271894, U24CA248265, and U54HG012517), and of the 10.13039/100000005Department of Defense through awards (W81XWH2210247 and W81XWH2210751). N.K.W., H.K.W., J.O., R.A., and C.Z. were supported by the Jonsson Comprehensive Cancer Center Fellowship. A.E.G. was supported by the Howard Hughes Medical Institute Gilliam Fellowship. N.Z. was supported by the 10.13039/100000002National Institutes of Health through awards (T32HG002536 and F31CA281168). L.Y.L. was supported by the Canadian Institutes of Health Research Vanier Fellowship and the Ontario Graduate Scholarship. S.W. was supported by the UCLA Tumor Cell Biology Training Program through the USHHS Ruth L. Kirschstein Institutional National Research Service Award (T32CA009056). B.N. was supported by the 10.13039/100000092National Library of Medicine (T15LM013976) Training Grant and ASCO Young Investigator Award. B.L.T. was supported by the UCLA Cancer Center Support Grant (P30CA016042) and the 10.13039/100000002National Institutes of Health through awards (U2CCA271894, U24CA248265, and R01CA272678). R.H. was supported by EMBO Postdoctoral Fellowship ALTF 1131-2021 and the Prostate Cancer Foundation Young Investigator Award 22YOUN32. R.A. was supported by awards (T32GM008042 and T32GM152342). Funding for open access charge: (R01CA244729).
Author contributions
Y. Patel, conceptualization, methodology, software, visualization, and writing – original draft; C.Z., conceptualization, methodology, and software; T.N.Y., conceptualization, methodology, and software; N.K.W., validation and software; N.W., software and methodology; N.Z., software and methodology; A.E.G., software and methodology; H.K.W., software and methodology; Y. Pan, software; M.F.E.M., software, methodology, and validation; T.S., software; S.T.F-G., software, methodology, visualization, validation, and data curation; C.K., software and methodology; J.L., software and methodology; L.Y.L., software and methodology; B.C., software; A.H., software; J.O., software; J.S., software; S.T., software; S.E., software; R.H.-W., software and data curation; K.P., software; A.B., software; M.J., software; S.W., software; M.T., software and methodology; J.A., software; B.N., software; R.H., software; Y.Z.B., software; G.K., software; J.S., software; W. Zhang, software; A.A., software; E.H., software; A.N., software; P.S., software; W. Zhao, software; P.A., software; R.A., software; B.L.T., software; P.C.B., conceptualization, project administration, supervision, and writing – review and editing.
Declaration of interests
P.C.B. sits on the Scientific Advisory Boards of Intersect Diagnostics Inc. and BioSymetrics Inc., and previously sat on that of Sage Bionetworks.
STAR★Methods
Key resources table
REAGENT or RESOURCESOURCEIDENTIFIERDeposited dataPCAWG-63 whole-genome sequencing of 5 tumour-normal pairsThe ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium33EGAS00001001692TCGA exome sequencing of 5 tumour-normal pairsThe Cancer Genome Atlas Research Network32https://portal.gdc.cancer.gov/ClinVarLandrum et al.37https://www.ncbi.nlm.nih.gov/clinvar/Software and algorithmsBWA-MEM2Vasimuddin et al.38v2.2.1GATKMcKenna et al.39v4.2.4.1GATKMcKenna et al.39v3.7.0SAMtoolsLi et al.19v1.18PicardBroad Institute21v3.1.0DeepVariantPoplin et al.40v1.9.0HailHail Team41v0.2.113Hap.pyIllumina42v0.3.15ADMIXTUREAlexander and Lange43v1.3.0PLINKPurcell et al.442.00a4.5lmDelly2Rausch et al.45v1.2.6MantaChen et al.46v1.6.0mitoCallerDing et al.3v1.0.0DeepSomaticPark et al.47v1.9.0MuSE2Ji et al.48v2.0.4SomaticSniperLarson et al.49v1.0.5.0Strelka2Kim et al.50v2.9.10BCFtoolsDanecek et al.51v1.17BPGP’ng et al.52v7.1.0VennDiagramChen and Boutros53v1.7.4SVisionLin et al.54v1.4circlizeGu et al.55v0.4.16CNV_FACETSShen and Seshan56v0.16.0BattenbergNik-Zainal et al.57v2.2.9PyClone-VIGillis and Roth58v0.1.2PhyloWGSDeshwar et al.59v2205be1FastCloneXiao et al.60v1.0.9StableLiftWang et al.24v1.0.0SnpEffCingolani et al.61v5.1dVEPMcLaren et al.62v101.0CEVWinata et al.63v2.0.0PipeValPatel et al.29v5.1.0NextflowDi Tommaso et al.13v23.04.2metapipeline-DNAThis paperDeposited at https://doi.org/10.5281/zenodo.15085443 and at https://github.com/TheBoutrosLab/metapipeline-DNA; and at https://github.com/uclahs-cds/metapipeline-DNA for older versions
Method details
Analysis cohort
To demonstrate the use of metapipeline-DNA, we chose ten normal-tumour pairs. Five were WGS tumour-normal pairs from PCAWG-63: one from uterine corpus endometrial carcinoma donor DO43506, one from biliary tract carcinoma donor DO218695 and three from esophageal adenocarcinoma donors DO50342, DO50407 and DO50311. Five were exome sequencing pairs of soft tissue sarcoma pairs from TCGA donors TCGA-QQ-A8VD, TCGA-X6-A8C6, TCGA-HS-A5N8, TCGA-DX-A1L2 and TCGA-HB-A2OT.32^,^33
Alignment and variant calling
Sequencing reads were aligned to the GRCh38.p7 reference build including decoy contigs from GATK using BWA-MEM238 (v2.2.1) in paired-end mode followed by duplicate marking with MarkDuplicatesSpark using GATK39 (v4.2.4.1). For the GRCh37 runtime benchmarking, alignment was performed to the GRCh37 reference build including decoy contigs. The resulting alignments were recalibrated through Indel realignment using GATK (v3.7.0) and base-quality score recalibration using GATK (v4.2.4.1). Quality metrics were generated using SAMtools19 (v1.18) stats and Picard21 (v3.1.0) CollectWgsMetrics. Germline SNPs were called using DeepVariant40 (v1.9.0) and HaplotypeCaller from GATK (v4.2.4.1) followed by variant recalibration using GATK (v4.2.4.1). Germline SNPs underwent XY filtration using Hail41 (v0.2.113) with benchmarking assessment performed using Hap.py42 (v0.3.15). Genetic ancestry was called using germline SNPs and INDELs as inputs using ADMIXTURE43 (v1.3.0) and PLINK44^,^64 (2.00a4.5lm). Germline SVs were called using Delly245 (v1.2.6) and Manta46 (v1.6.0). Mitochondrial SNVs were called using mitoCaller3 (v1.0.0). Somatic SNVs were called using DeepSomatic47 (v1.9.0), MuSE248 (v2.0.4), SomaticSniper49 (v1.0.5.0), Strelka250 (v2.9.10) and Mutect239 (v4.5.0.0) followed by a consensus workflow to identify variants called by two or more callers using BCFtools51 (v1.17) with quality control plots generated with BPG52 (v7.1.0) and VennDiagram53 (v1.7.4). Somatic SVs were called using SVision54 (v1.4), Delly245 (v1.2.6) and Manta46 (v1.6.0) and visualized with circlize55 (v0.4.16). Somatic CNAs were called using CNV_FACETS56 (v0.16.0) for the PCAWG sample and using Battenberg57 (v2.2.9) for the TCGA sample with visualization generated using BPG52 (v7.1.0). Taking the consensus set of somatic SNV calls and the CNA calls, subclonal reconstruction was performed using PyClone-VI58 (v0.1.2), PhyloWGS59 (v2205be1) and FastClone60 (v1.0.9). Variant liftover was performed using BCFtools51 (v1.20) and stability prediction using StableLift24 (v1.0.0). Variant annotation was done using SnpEff61 (v5.1d), Funcotator39 (v4.2.4.1) and VEP62 (v101.0) and ClinVar37 (v20211016). Reconstructed phylogeny was visualized using CEV63 (v2.0.0). Data validation was performed with PipeVal29 (v5.1.0) and data processing was done using Nextflow13 (v23.04.2).
Benchmarking analysis
Three variant-calling pipelines were assessed against metapipeline-DNA: nf-core/sarek,34 Sentieon35 and DRAGEN.36 Qualitative assessment of pipeline features was performed with all pipelines with quantitative benchmarking focused against nf-core/sarek. Somatic variant calling assessment compared variant calls made by nf-core/sarek and metapipeline-DNA with a deep sequencing validation callset. For each pipeline, F1 scores were calculated based on variant calls made and resource usage was compared using the peak memory usage. Germline variant calling was assessed by comparison with an available benchmarking truth set of calls. For both pipelines, F1 scores and false discovery rates were calculated for the INDEL and SNP categories of germline variants.
Quantification and statistical analysis
Quantification of coverage was performed with two metrics. The percent of based exhibiting at least a given coverage depth were plotted from depths of 0× to 100×. Cohort-level coverage summaries were generated through plotting distributions of mean and median coverage values per sample. For assessment of consensus somatic SNV calls, VAFs across the variant callers were adjusted with a weighted average for each combination of 1, 2, 3 and 4 variant callers based on selection of variant sets called by the respective number of callers. Performance assessment of variant calling was done for each category of calls (somatic SNVs, germline INDELs, germline SNPs) using true positives (TP), false negatives (FN), false positives (FP), F1 score, false discovery rate (FDR) and sensitivity. Computational performance was quantified through comparison of pipeline runtime in real time and of peak memory usage.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Shendure J.Balasubramanian S.Church G.M.Gilbert W.Rogers J.Schloss J.A.Waterston R.H.DNA sequencing at 40: past, present and future Nature 55020173453532901998510.1038/nature 24286 · doi ↗ · pubmed ↗
- 2Logsdon G.A.Vollger M.R.Eichler E.E.Long-read human genome sequencing and its applications Nat. Rev. Genet.2120205976143250407810.1038/s 41576-020-0236-x PMC 7877196 · doi ↗ · pubmed ↗
- 3Ding J.Sidore C.Butler T.J.Wing M.K.Qian Y.Meirelles O.Busonero F.Assessing mitochondrial DNA variation and copy number in lymphocytes of 2,000 Sardinians using tailored sequencing analysis tools P Lo S Genet.112015 e 100530610.1371/journal.pgen.1005306 PMC 450184526172475 · doi ↗ · pubmed ↗
- 4Zhang Y.Liu X.Wiggins K.L.Kurniansyah N.Guo X.Rodrigue A.L.Zhao W.Yanek L.R.Ratliff S.M.Pitsillides A.Association of Mitochondrial DNA Copy Number With Brain MRI Markers and Cognitive Function: A Meta-analysis of Community-Based Cohorts Neurology 10020231930194310.1212/WNL.0000000000207157 PMC 1015977036927883 · doi ↗ · pubmed ↗
- 5Gauthier J.Vincent A.T.Charette S.J.Derome N.A brief history of bioinformatics Brief. Bioinform.202019198119963008494010.1093/bib/bby 063 · doi ↗ · pubmed ↗
- 6Puttick C.Jones T.P.Leung M.M.Galvez-Cancino F.Liu J.Varas-Godoy M.Rowan A.Pich O.Martinez-Ruiz C.Bentham R.MHC Hammer reveals genetic and non-genetic HLA disruption in cancer evolution Nat. Genet.562024212121313935860110.1038/s 41588-024-01883-8PMC 11525181 · doi ↗ · pubmed ↗
- 7Chakravarty D.Solit D.B.Clinical cancer genomic profiling Nat. Rev. Genet.2220214835013376273810.1038/s 41576-021-00338-8 · doi ↗ · pubmed ↗
- 8Sosinsky A.Ambrose J.Cross W.Turnbull C.Henderson S.Jones L.Hamblin A.Arumugam P.Chan G.Chubb D.Insights for precision oncology from the integration of genomic and clinical data of 13,880 tumors from the 100,000 Genomes Cancer Programme Nat. Med.3020242792893820025510.1038/s 41591-023-02682-0PMC 10803271 · doi ↗ · pubmed ↗
