An nf-core framework for the systematic comparison of alternative modeling tools: the multiple sequence alignment case study
Luisa Santus, Jose Espinosa-Carrasco, Leon Rauschning, Júlia Mir-Pedrol, Igor Trujnara, Alessio Vignoli, Leila Mansouri, Athanasios Baltzis, Evan W Floden, Paolo Di Tommaso, Edgar Garriga, Adam Gudyś, Sebastian Deorowicz, Cameron Gilchrist, Martin Steinegger, Cedric Notredame

TL;DR
This paper introduces a framework to compare and deploy multiple sequence alignment tools efficiently in high-performance computing environments.
Contribution
A novel nf-core framework is introduced for systematic evaluation and deployment of MSA tools in HPC settings.
Findings
The framework enables streamlined deployment and comparison of multiple sequence alignment tools.
It serves as a proof of concept for broader bioinformatics tool evaluation.
The framework is open-source and available for immediate use by the MSA community.
Abstract
The computational complexity of many key bioinformatics problems has resulted in numerous alternative heuristic solutions, where no single approach consistently outperforms all others. This creates difficulties for users trying to identify the most suitable tool for their dataset and for developers managing and evaluating alternative methods. As data volumes grow, deploying these methods becomes increasingly difficult, highlighting the need for standardized frameworks for seamless tool deployment and comparison in high-performance computing (HPC) environments. Multiple sequence aligners (MSAs) rank among the most commonly employed modeling techniques in bioinformatics, playing a crucial role in applications such as protein structure prediction, phylogenetic reconstruction, and variant effect prediction. MSAs are NP-hard problems, which makes them a major example of computational…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3| Accepts input guide tree | Uses structural information | Reference | |
|---|---|---|---|
| ClustalO | Yes | No | [ |
| FAMSA | Yes | No | [ |
| FoldMason | Yes | Yes | [ |
| Kalign3 | Yes | No | [ |
| LearnMSA | No | No | [ |
| MAFFT | Yes | No | [ |
| mTM-Align | No | Yes | [ |
| MAGUS | Yes | No | [ |
| Muscle5 | No | No | [ |
| Regressive | Yes | No | [ |
| T-Coffee | Yes | No | [ |
| UPP2 | Yes | No | [ |
| 3D-Coffee | Yes | Yes | [ |
| Guide tree | Distance measure | Core algorithm | Speed-up heuristic |
|---|---|---|---|
| MAFFT |
| UPGMA + SL combined | |
| MAFFT --minimumlinkage |
| SL | |
| MAFFT --averagelinkage |
| UPGMA | |
| MAFFT --parttree |
| SL + UPGMA combined | PartTree |
| MAFFT --dpparttree | Dynamic programming alignment-based | SL + UPGMA combined | PartTree |
| MAFFT --fastaparttree | FASTA alignment-based | SL + UPGMA combined | PartTree |
| CLUSTALO | Sequence embedding + approximate alignment | UPGMA | Bisecting k-means |
| FAMSA | LCS-based | SL | |
| FAMSA -gt upgma | LCS-based | UPGMA | |
| FAMSA -gt nj | LCS-based | Neighbor joining | |
| FAMSA -parttree | LCS-based | SL | PartTree |
| FAMSA -gt upgma -parttree | LCS-based | UPGMA | PartTree |
| FAMSA -medoidtree | LCS-based | SL | MedoidTree |
| FAMSA -gt upgma -medoidtree | LCS-based | UPGMA | MedoidTree |
| Method | Reference-based | Structure-based | Reference |
|---|---|---|---|
| Sum-of-pairs (SP) | Yes | No | [ |
| Total column (TC) | Yes | No | [ |
| TCS | No | No | [ |
| Number of gaps | No | No | Custom script |
| NiRMSD | No | Yes | [ |
- —Spanish Ministry of Science and Innovation10.13039/501100004837
- —AEI10.13039/100005910
- —FSE+10.13039/501100004895
- —Centro de Excelencia Severo Ochoa
- —Generalitat de Catalunya10.13039/501100002809
- —National Science Centre, Poland10.13039/501100004442
- —National Research Foundation of Korea10.13039/501100003725
- —Seoul National University10.13039/501100002551
- —German Academic Scholarship Foundation
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Algorithms and Data Compression · Machine Learning in Bioinformatics
Introduction
The massive generation of biological data over the last decades has resulted in an increasing reliance on fast, approximate heuristic algorithms in bioinformatics tools. Multiple sequence alignment (MSA) is an example of a problem where alternative heuristic methods are available [1]. Yet, this methodological configuration is far from being an isolated case. Indeed, data growth has been such that even for tasks having exact solutions, such as database search [2], faster heuristic solutions were developed, like BLAST [3] or MMseqs [4, 5]. Having more than one tool to carry out the same task is a mixed blessing. For instance, systematic benchmarks of different MSA algorithms have consistently revealed uneven performances across alternative reference datasets [1]. A similar situation was reported for read mapping [6] or genome assembly methods [7]. This situation brings challenges for both users and developers. Users face the complex task of navigating a diverse array of tools and having to select the most suitable option, considering factors such as accuracy and speed. Meanwhile, developers who investigate and publish novel methods must deploy extensive comparisons to the state of the art.
MSA modeling is a prime example of a problem where the computational complexity has forced pervasive reliance on approximations. MSAs are among the most widely utilized modeling techniques in bioinformatics as they provide the means to compare and analyze genetic sequences, unveiling evolutionary relationships and conserved regions [8]. This capacity is fundamental for various biological applications, including protein structure prediction, functional annotation, protein families classification, evolutionary modeling, and the understanding of genetic variation [9–12]. The forecast output of large-scale sequencing projects, such as the Earth BioGenome Project [13], will present the MSA field with major scale-up challenges requiring improvements well beyond the current state of the art [1].
While structure-based approaches to MSA have hitherto been limited by the availability of experimentally determined protein structures, recent progress in protein structure prediction [11, 14–16] raises the possibility of utilizing structural information in MSA at a large scale. This additional source of information is especially useful in highly divergent sequences, where the degree of conservation of measured or predicted protein folds can allow more robust alignments than sequence similarity alone [17, 18], but presents computational and deployment challenges. Recent work on using structurally augmented alphabets to improve the scalability of structural MSA [19, 20] confirms the fate of MSA methods as an -omics data integrative procedure. Consequently, a key objective is to provide the community with a generic and flexible deployment platform that integrates and compares all these methods.
Establishing such a unified framework brings several practical challenges. The most obvious issue is that the requirements of the new algorithms can be very diverse. For instance, they may rely on CPU, GPU, or a combination of both, require a local copy or cloud-based access to a large data corpus, and have complex chains of dependencies on libraries and tools. The community has long recognized these problems and it is now considered good practice to deploy scientific computation using workflow management systems [21], offering efficient task orchestration and resource management. Nextflow [22] has become one of the most popular of these solutions [23]. Nextflow is used by the nf-core community [24], aimed at developing and supporting standardized pipelines to cover all aspects of computational biology. The nf-core ecosystem provides a highly suitable environment for incorporating novel computational developments while ensuring they remain connected to the large corpus of existing methods.
In this application note, we introduce nf-core/multiplesequencealign: a framework designed to facilitate MSA deployment while providing rigorous performance evaluation and benchmark reporting. Harnessing Nextflow’s DSL2 extension [23], the pipeline’s modular design enables the easy integration of newly developed methods. In this first installment, we have integrated all the methods used for a recent large-scale benchmark [1]: ClustalO [25], FAMSA [26], Kalign3 [27], LearnMSA [28], MAFFT [29], MAGUS [30], Muscle5 [31], Regressive [32], and UPP2 [33], as well as other widely used tools such as T-Coffee [34] and structure-based aligners: FoldMason [19], mTM-Align [35], and 3D-Coffee [18]. The large-scale capacities of most of these components were previously benchmarked [1]. Yet nf-core/multiplesequencealign is as much a functional pipeline as a seed for an MSA ecosystem in which all members of the community should feel invited to integrate their own MSA methods. Its documentation has therefore been developed for both tool users and developers. This framework should also be considered a pilot for other bioinformatics problems that require the deployment and comparison of alternative modeling methods, such as protein structure prediction, differential expression analysis, and other computationally complex tasks.
Materials and methods
Implementation
nf-core/multiplesequencealign is written in Nextflow [22] DSL2 syntax [23] and is part of the “nf-core” collection of bioinformatics workflows [24]. The modular nature of DSL2 extension is crucial to allow (i) the reusability of the individual modules as well as (ii) the integration of new components into the pipeline, such as new MSA algorithms or evaluation techniques (see the “Integration of new methods” section). As required by the nf-core guidelines, each of the modules is provided with a stable software environment via either the conda package manager or software containers such as Docker and Singularity. The implementation in Nextflow combined with the usage of software containers ensures the reproducibility and portability of nf-core/multiplesequencealign across various computing environments (local computers, HPC clusters, and different cloud providers). A contained test dataset is used for continuous integration testing and provided at https://github.com/nf-core/test-datasets/tree/multiplesequencealign. A full-size dataset is provided at s3://ngi-igenomes/testdata/nf-core/pipelines/multiplesequencealign/1.1.1 and is used to test the pipeline on AWS. The pipeline also comes with extensive documentation of its parameters and usage to be found at https://nf-co.re/multiplesequencealign.
Input
The minimal input for the pipeline consists of either a FASTA file or a set of protein structures, in addition to the specification of which MSA tool to use. The input can be provided in two alternative ways: (i) The parameters --seqs and --pdbs_dir can be used to provide the FASTA file and/or the set of protein structures, respectively, --tree, --args_tree,--aligner, and --args_aligner to provide the directives of which combination of guide tree and aligner to use. This will only allow deploying one dataset using one method for each pipeline run. (ii) Alternatively, to deploy the pipeline on multiple samples and multiple tools in parallel, the pipeline’s input must consist of two CSV files. The one received by the parameter --input contains the information about the input files, such as the sequences to be aligned, any accessory information to the sequences (e.g. protein structures), and reference alignments to be used for benchmarking. For basic MSA deployment, either the protein sequences or structures are sufficient input, depending on the MSA tools specified. The remaining inputs are optionally used by other modes of the pipeline, such as benchmarking. The second CSV received by the parameter --tools contains instructions about which of the tools are to be deployed. Each row specifies one combination of tools, specifically of a guide tree and an assembly method, and the specific combination of flags for both the guide tree and the assembly. A more extensive description of how to build each of the input files can be found in the pipeline’s documentation.
Integration of new methods
An essential property of nf-core/multiplesequencealign is its extensible, modular structure, which allows the incorporation of new methods as they get developed by the community. Adding a new MSA method to the pipeline involves just two steps. First, one needs to provide the tool as a Nextflow module. Ideally, this should be formatted as an nf-core module (https://nf-co.re/modules/) to align with community standards for interoperability and reproducibility. The module’s output is expected to conform to the established standards within its category, e.g. the module of an MSA tool should output an MSA in FASTA format. It is advisable to use existing modules from the same category as templates. These can be found directly in nf-core/modules (https://nf-co.re/modules/) as well as directly installed in the pipeline (https://github.com/nf-core/multiplesequencealign/tree/master/modules/nf-core). Then, the relevant subworkflow has to be updated to incorporate the new module correctly, using the documentation at https://nf-co.re/multiplesequencealign/docs/usage/adding_a_tool.html and examples from existing tools as a guide. Further instructions for integrating additional modules can be found in the pipeline’s documentation.
Results
nf-core/multiplesequencealign deploys and compares various MSA tools: an overview of the pipeline
The pipeline is explicitly designed to deal with protein sequences. As such, it can receive sequences, structures, or a combination of both as input, depending on the requirements of the respective tools (see the “Input” section). The nf-core/multiplesequencealign pipeline consists of six main steps (Fig. 1). In the first step, input files are collected, validated, and, optionally, summary statistics are computed (number of sequences to be aligned, sequence length, pairwise sequence similarity). Extra metrics based on structural inputs can also be extracted, such as the pLDDT measure for predicted protein structure quality. The second step involves the computation of the guide tree that may be required by the MSA assembly carried out in step #3. This layout reflects the algorithmic nature of most MSA packages that are based on the progressive alignment algorithm [36] in which sequences are first clustered into a binary guide tree and subsequently progressively incorporated into the final MSA following the order indicated by this tree. In nf-core/multiplesequencealign, we have explicitly separated steps #2 and #3 to provide a framework that supports the systematic exploration of combinations between guide tree and MSA assembly algorithms. Step #4 is facultative and estimates a consensus model from the computed MSAs using M-Coffee [37]. Step #5 evaluates the computed alignments with multiple commonly used quality metrics (see the “MSA evaluation” section). The final step (#6) assembles the input statistics and alignment evaluation metrics into a final report. Step #6 optionally also provides the visualization of the rendered MSAs and the corresponding structures using FoldMason [19]. A key feature of the pipeline, stemming from its Nextflow implementation, is that multiple tools, as well as multiple combinations of tools and parameters, can be run in parallel. Provided some external quality control is available, such as reference alignments or quality metrics, users can run the pipeline as a benchmark analysis and select the combination of tools most likely to perform well on their production datasets.
Visual outline of the nf-core/multiplesequencealign pipeline (v1.1.1). First, optionally, the input files are characterized via a collection of summary statistics. Optionally, a guide tree is rendered. Then, the MSA is computed. Finally, various metrics can optionally be calculated to evaluate the tools’ performance. Eventually, these results are summarized in a visual form in MultiQC and a custom shiny app.
This pipeline is therefore a dual tool: (i) a versatile production pipeline allowing the processing of any dataset with a user-defined methodology and (ii) a benchmark allowing the systematic comparison of alternative MSA procedures and subsequent parameterization of an optimal production pipeline. These two alternative ways of deploying the pipeline are made possible by the toolsheet feature that defines which MSA procedures will be applied to each dataset.
The toolsheet enables parallel comparison of tools and parameters’ configurations
MSAs, like most bioinformatics tools, rely on complex parameter configurations. Although all possible command line options are part of the same package, in practice, each unique combination of parameters may function as a distinct tool, and sometimes even deploy entirely different algorithms. For instance, in T-Coffee, the default algorithm uses the original consistency-based T-Coffee algorithm [34], while the -reg flag triggers a completely different large-scale alignment mode [32]. Other popular tools, such as MAFFT, support an even broader range of algorithms [29]. This complexity makes parameterization a feature as important to trace and maintain as the software name and version.
On top of this, each combination of computing elements (e.g. guide tree and assembly) defines a procedure with distinct properties. Each of these procedures is considered separately for benchmarking, as it holds specific properties. The pipeline must account for this complexity, make it traceable, and allow users to unambiguously identify the effect of each component when benchmarking. To guarantee this, in nf-core/multiplesequencealign a pipeline benchmarking execution can be specified by two input directives: the samplesheet defining the input file (containing the protein sequences, structures, reference alignment, etc.), as used in most nf-core pipelines, and a toolsheet, which declares the tools that will be used and compared. Each line in the toolsheet is considered a distinct procedure in the benchmark and the same tool may appear on several lines with different parameters. Information provided by the toolsheet includes the guide tree method (if applicable), the assembly method, and the command line parameters. Each combination of tools defined in the toolsheet is then computed for each sample specified in the samplesheet. This process can be seen as a Cartesian product of the sample and toolsheet. This results in a collection of MSAs ready to be compared with one another or with reference datasets. In a production use case where a single output is desired, the possibility exists to limit the toolsheet to a single tool only or to use the --tree, --args_tree, --aligner, and --args_aligner parameters (see the “Materials and methods” section).
Guide tree computation
Guide trees, which are a hierarchical clustering of input sequences, define the order in which sequences are aligned. They are a prerequisite for most progressive MSA methods [1]. Guide trees have been shown to have a major impact on MSA accuracy [32]. Their computation can be a bottleneck in large datasets as they typically rely on a distance matrix of all sequences to be calculated, requiring O(N^2^) operations. As of version 1.1.1, the pipeline includes FAMSA [26] to support the stand-alone computation of the most common types of guide tree methods, specifically neighbor joining [38], the unweighted pair group method with arithmetic mean (UPGMA) [39], single linkage (SL), and PartTree [40] using either SL or UPGMA. Multiple flavors of the PartTree heuristic can also be deployed using MAFFT [40], reported in detail in Table 2. The pipeline also supports the ClustalO [25] implementation of mBed trees, which are based on a combination of k-means clustering and sequence embeddings [41]. The guide tree computation step has been designed modularly to allow the easy incorporation of new guide tree methods that users may be interested in.
MSA assembly
Assembly refers to the step where the final MSA is constructed by combining all the sequences, a procedure common to all MSA tools, regardless of their dependence on a guide tree [1, 42]. The pipeline integrates some of the most widely used large-scale assembly methods (listed in Table 1), and like the other sections of the pipeline, it has been specifically designed in a modular way to support the integration of novel tools as they get developed (see the “Integration of new methods” section).
In the current implementation, iterative algorithms occasionally employed for MSA computation are carried out within the framework of their dedicated software, even when these iterations involve re-estimating a guide tree [25, 43].
Generation of a consensus MSA
Dealing with several alternative models estimated on the same data presents a significant challenge, especially when no single objective measure is available to unequivocally decide which model should be preferred over the others. Consensus MSAs provide a convenient way to address this issue as they allow the information content of several MSAs to be merged into a single model. This integration is especially useful for identifying highly supported regions in multiple MSAs. These have been shown to match the structural similarity of proteins more closely than less supported regions [37]. The current version of the pipeline supports the generation of consensus alignments using the M-Coffee protocol. Under its formulation, all pairwise projections are extracted from the initial MSAs and used to populate a position-specific scoring scheme based on consistency. This scheme is then used to estimate a progressive MSA. This MSA constitutes an empirical consensus that has been shown to have a high average sum-of-pairs score, comparable to the scores of the individual MSAs. In other words, the consensus MSA effectively captures the shared characteristics of the constituent alignments while preserving a level of quality similar to that of the original inputs. The consensus MSA also comes with several global and local metrics, allowing an estimate of the global agreement between the alternative alignments.
MSA evaluation
A significant hurdle in the development of novel MSA algorithms lies in the intricacy of their evaluation. For a long time, the lack of references resulted in the MSA problem being defined in terms of string matching. Under this formulation, an MSA is optimal if it maximizes the alignment score between the projected alignment of every pair of sequences. This formulation is known as the sum of pairs (SoP). In practice, the SoP formulation constitutes, more or less explicitly, the objective function of most MSA algorithms. The SoP is, however, known to be limited in its capacity to produce structurally optimal MSAs [44]. The question of how to evaluate MSA accuracy therefore remains partly open, especially considering the lack of an absolute, formal definition of MSA correctness. Indeed, the final evaluation in practice will often depend on domain-specific criteria that may arbitrarily incorporate elements of evolutionary, structural, or functional perspectives on correctness.
In biological terms, an MSA may be described as the product of an evolutionary process and should, as such, precisely reflect the unique evolutionary history of the observed sequences. To be of any use, this formulation requires prior knowledge of the correct underlying phylogeny, an information usually not available. However, evolutionary correctness is only one perspective when evaluating MSAs.
Often, MSAs are used to model structural similarity among homologous sequences [45], in which case correctness is defined as a geometric problem. While structure- and evolution-based criteria for evaluating MSAs tend to have some level of agreement, their equivalence remains disputed [43, 46, 47]. In practice, the most commonly used validation method has been the comparison of sequence-based MSAs with reference structure-based counterparts. These comparisons can be carried out using either the SoP between the two MSAs or the comparison of entire columns (total column, TC). However, reference datasets of structure-based MSAs are not entirely straightforward to establish, as the computation of structure-based MSAs is also an NP-hard problem. This extra hurdle has resulted in the establishment of several alternative reference MSA collections [48–50]. Alternative alignment-free methods were also reported, based on the comparison of structural properties [19, 51–55].
The current implementation of the pipeline features the possibility to do an SoP and a TC comparison between the MSA produced by any procedure on any input sample and a user-provided sample-specific reference (specified in the “reference” column of the samplesheet), using these metrics as implemented in the T-Coffee package [34]. The pipeline also supports two alignment-free metrics, the TCS score that estimates the agreement between an MSA and the optimal alignments associated with its pairwise projections [46], and the normalized intramolecular root mean square deviation (NiRMSD), if at least two protein structures are provided for the input sample [51]. The NiRMSD quantifies the structural agreement of the aligned sequences. The pipeline can also provide simple summary metrics, including the count of gaps in the alignment (Table 3).
We expect the effective evaluation of MSAs to remain a very active field of investigation, especially under the pressure of deep-learning-based methods, now routinely used to scan MSAs for pathogenic mutations [9] and the development of new structural aligners in light of the wide availability of predicted protein structures [19]. For this reason, the implementation of the evaluation section has also been explicitly designed to facilitate the integration of novel evaluation metrics.
As datasets grow, feasible resource requirements are crucial to maintaining the practicality of these tools. Resource utilization, including CPU usage, memory consumption, and runtime, is therefore a crucial factor in tool evaluation. To address these needs, nf-core/multiplesequencealign leverages Nextflow’s native tracing and the nf-co2footprint plugin to generate a final report that includes detailed resource metrics. The computational requirements for guide tree computation and assembly are extracted from the general resource usage report and merged into the final CSV report alongside other quality metrics, such as SoP and TC scores, to facilitate a more comprehensive evaluation.
Summary report
The pipeline offers an optional summary report detailing the metrics calculated on the input datasets and the evaluation of the resulting alignments using MultiQC [56] (Fig. 2). Additionally, if structural data is provided, Foldmason [19] enables user-friendly visualization of the rendered MSAs along with their corresponding structures. Examples of these visualizations, along with all other pipeline outputs, are available in the publicly accessible nf-core AWS results (see the “Data availability” section). nf-core/multiplesequencealign also provides a shiny-based browser interface to explore the results, as illustrated in Fig. 3. For exact instructions on how to start the shiny app please refer to the usage documentation (https://nf-co.re/multiplesequencealign/usage). A standardized way to systematically evaluate the impact of certain input features, captured by the summary statistics reported by nf-core/multiplesequencealign, can provide MSA tool developers with a deeper understanding of when MSA tools perform well and when they fail. This can help optimize tools for specific situations or even inspire the development of new tools exploiting different kinds of signals in the data.
MultiQC summary table. A single dataset (column fasta) was used to test a variety of tools (combination of columns tree and aligner). The remaining columns show summary statistics that characterize the input files (n_sequences, seqlength_mean, perc_sim) and the tool performance evaluation (sp, tc, NiRMSD, plddt).
Screenshot of the shiny app of the pipeline. A shiny app is created to explore the results of the analysis. All metrics (summary statistics about the input files, type of tools and parameters deployed, evaluation metrics, time, and memory requirements) are available for interactive exploration.
Pipeline chaining
A key design principle within the nf-core community indicates that each pipeline should focus on a particular type of analysis. For instance, pipelines may be dedicated to RNA-Seq (nf-core/rnaseq), variant calling (nf-core/sarek) [57], or protein folding (nf-core/proteinfold) [58]. However, it is extremely common for the output of one pipeline to serve as the input for another. For instance, the nf-core/proteinfold pipeline accepts one or more FASTA files and utilizes various protein structure prediction tools, such as AlphaFold2 or ColabFold, to generate predicted structures. The obtained protein structures can be an input for the nf-core/multiplesequencealign workflow. Therefore, integrating the results of nf-core/proteinfold into nf-core/multiplesequencealign is bound to be a common use case. nf-core/multiplesequencealign supports this integration, as detailed in the documentation.
Discussion
Despite its central role in biological analysis, the effective deployment of MSA computation remains hampered by rapidly growing genomics datasets, the high computational complexity of available algorithms, and the lack of deployment and documentation standards. On top of this, the wide range of downstream uses of MSAs, spanning from orthology detection, protein family classification, and functional genomics to phylogeny reconstruction and protein structure prediction, each brings its own challenges and requirements. Combined with the lack of clear, unequivocal solutions, this has resulted in an extremely fragmented software landscape. The sequence alignment Wikipedia page lists 52 distinct MSA packages (https://en.wikipedia.org/wiki/List_of_sequence_alignment_software#Multiple_sequence_ alignment) developed over >35 years. The packaging, usage, and output formats of this software are not standardized in any way and remain extremely diverse. This makes general usage and thorough comparisons difficult: except for a recent review on large-scale alignments this work is based upon, we are not aware of any comparative study including >10 MSA packages. Yet, the most ancient of these solutions were developed at a time when pressure on hardware, especially concerning memory, was stronger than it is today. It cannot be ruled out that older algorithms may contain the seed of future solutions to the current scale-up problem. Maintaining the memory of the field is, therefore, an issue of prime practical interest. Achieving this goal will require the community to come together to develop an ecosystem in which all the relevant methods can be dynamically maintained and systematically compared.
nf-core/multiplesequencealign paves the way toward the establishment of an exhaustive and possibly automatically updated reference benchmark workflow for MSA tools. The requirement of live standardized benchmarks is crucial in the case of complex computational processes that combine various heuristics. Often enough, these algorithms are iteratively improved by their authors. Yet, whenever an updated algorithm is released, it is down to the data analysis team to establish whether improvements carried out independently across combined algorithms are additive, synergistic, or antagonistic. It is also down to the users to determine a suitable trade-off between the energetic cost of their computation and the resulting added value. In practice, one may expect this assessment to be domain-dependent or even test-case-dependent. Facilitating benchmarking is therefore a major goal for the long-term sustainability of computational biology. We expect that the establishment of live benchmarking will be needed well beyond MSA, across all domains relying on fast-evolving algorithms.
While our framework does not directly address the challenges associated with the lack of clear and unbiased benchmarking datasets for MSAs, it offers an approach that is both dataset-agnostic and inclusive of multiple evaluation methods deemed of interest by different communities of users. This is particularly important in light of the myriad downstream uses of MSAs that have led to the development of a large number of accuracy metrics over the years. Establishing the relative merits of these alternative evaluations has been proven difficult and it is common practice to report readouts on two or more reference datasets or stand-alone metrics. Once populated with all the relevant evaluation methods, our pipeline should allow for more systematic comparisons. More importantly, it will allow users to identify which metrics best reflect their needs and choose their alignment strategy accordingly. Since the metric defines the biological optimality of an MSA, its selection is as important as the algorithmic optimization capacities. Yet these two levels of optimization, the computational and the biological, have so far remained largely disconnected. The system we propose aims to bridge this gap by enabling exploration of the dependencies between optimization and evaluation.
The pipeline is designed to fully support template-based sequence alignments, a concept first introduced by 3D-Coffee nearly two decades ago. Template-based alignments rely on the concept that the alignment of a protein sequence may be enhanced by adding an additional layer of information. The template is a re-encoding of the sequence in which specific features are collected that lend themselves to the establishment of an alignment without using the original sequence information. Structural data is the most commonly used kind of information that can be associated with sequence data, which the pipeline can already use as a template. Over the years, other types of templates have been developed, such as RNA secondary structure-based [59] and evolutionary [60] templates. In theory, any data that can be mapped onto DNA, RNA, or protein sequences is suitable to become a template. For instance, the notion of functional templates using functional genomic information associated with sequences remains to be explored. It could provide important developments in DNA analysis, thanks to the wide availability of DNA functional data collected by projects like ENCODE or via prediction tools [61]. The consistency-based approach of T-Coffee in which templates were originally developed is especially suitable for this kind of integration. We anticipate that the use of additional information in MSA will become standard practice, especially for the alignment of regulatory regions, turning MSA into a multi-omics problem. We expect an extension of the pipeline to allow for these analyses to be an important future milestone.
The maintenance of a common hub for MSA methods is a long-term commitment. Because of the inherent computational complexity of the problem, MSA methods are bound to keep evolving, following the information content, scale, and downstream applications of sequences generated by biologists. By providing a common roof to a large number of methods, we expect that this pipeline will allow the gradual deconstruction of existing methods into smaller, modular algorithmic components. Indeed, and as reflected in the current layout of the pipeline, MSA algorithms consist of multiple, independent steps. Yet, historically, most packages have been distributed as self-contained units. This approach has often led to the reimplementation of existing components, tightly integrated at the core of each package. In the case of MSAs, some sub-procedures may be for instance the computation of the distances used for some guide tree computation, the guide tree computation itself, and the assembly. At times, this process makes it difficult to identify which component drives performance improvement. Addressing this problem will require systematically deconstructing every method into a collection of modules, each implementing a step of the original algorithm. The basic principle of Nextflow and their implementation in nf-core/multiplesequencealign provide an ideal framework for such deconstruction. This process requires a wide effort by the method development community, yet the modular nature of Nextflow allows it to be done in a progressive and decentralized manner by leveraging the power and resources provided by the nf-core community and its many members.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Santus L, Garriga E, Deorowicz S et al. Towards the accurate alignment of over a million protein sequences: current state of the art. Curr Opin Struct Biol. 2023; 80:10257710.1016/j.sbi.2023.102577.37012200 · doi ↗ · pubmed ↗
- 2Smith TF, Waterman MS Identification of common molecular subsequences. J Mol Biol. 1981; 147:195–7.10.1016/0022-2836(81)90087-5.7265238 · doi ↗ · pubmed ↗
- 3Altschul SF, Gish W, Miller W et al. Basic local alignment search tool. J Mol Biol. 1990; 215:403–10.10.1016/S 0022-2836(05)80360-2.2231712 · doi ↗ · pubmed ↗
- 4Hauser M, Steinegger M, Söding J M Mseqs software suite for fast and deep clustering and searching of large protein sequence sets. Bioinformatics. 2016; 32:1323–30.10.1093/bioinformatics/btw 006.26743509 · doi ↗ · pubmed ↗
- 5Steinegger M, Söding J M Mseqs 2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017; 35:1026–8.10.1038/nbt.3988.29035372 · doi ↗ · pubmed ↗
- 6Donato L, Scimone C, Rinaldi C et al. New evaluation methods of read mapping by 17 aligners on simulated and empirical NGS data: an updated comparison of DNA- and RNA-Seq data from Illumina and Ion Torrent technologies. Neural Comput Appl. 2021; 33:15669–92.10.1007/s 00521-021-06188-z.34155424 PMC 8208613 · doi ↗ · pubmed ↗
- 7Zhang X, Liu C-G, Yang S-H et al. Benchmarking of long-read sequencing, assemblers and polishers for yeast genome. Brief Bioinform. 2022; 23:bbac 14610.1093/bib/bbac 146.35511110 · doi ↗ · pubmed ↗
- 8Van Noorden R, Maher B, Nuzzo R The top 100 papers. Nature. 2014; 514:550–3.10.1038/514550 a.25355343 · doi ↗ · pubmed ↗
