TL;DR
Apollo is a universal, scalable assembly polishing algorithm that effectively integrates reads from all sequencing technologies to improve genome assembly accuracy without size or technology limitations.
Contribution
It introduces a novel pHMM-based approach that unifies polishing across different sequencing technologies and scales to large genomes in a single run.
Findings
Apollo uses reads from all technologies in one run
It scales well to large genome assemblies
It outperforms existing algorithms in accuracy and scalability
Abstract
Long reads produced by third-generation sequencing technologies are used to construct an assembly (i.e., the subject's genome), which is further used in downstream genome analysis. Unfortunately, long reads have high sequencing error rates and a large proportion of bps in these long reads are incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e., read-to-assembly alignment information). However, assembly polishing algorithms can only polish an assembly using reads either from a certain sequencing technology or from a small assembly. Such technology-dependency and assembly-size dependency require researchers to 1) run multiple polishing algorithms and 2)…
| Dataset | First Run | Second Run | Aligned | Accuracy | Polishing | Runtime | Memory |
|---|---|---|---|---|---|---|---|
| Bases (%) | Score | (GB) | |||||
| E. coli O157 | — | — | 99.94 | 0.9998 | 0.9992 | 43m 53s | 3.79 |
| E. coli O157 | Apollo (Hybrid) | — | 99.94 | 0.9999 | 0.9993 | 8h 16m 08s | 13.85 |
| E. coli O157 | Racon (PacBio) | Racon (Illumina) | 99.94 | 0.9994 | 0.9988 | 21m 44s | 22.65 |
| E. coli O157 | Racon (PacBio) | Racon (PacBio) | 99.94 | 0.9984 | 0.9978 | 4m 58s | 2.43 |
| E. coli O157 | Pilon (Illumina) | Pilon (Illumina) | 99.94 | 0.9999 | 0.9993 | 4m 10s | 11.40 |
| E. coli O157 | Pilon (Illumina) | Racon (PacBio) | 99.94 | 0.9986 | 0.9980 | 4m 58s | 11.40 |
| E. coli O157 | Quiver (PacBio) | Quiver (Pacbio) | 99.94 | 0.9998 | 0.9992 | 13m 06s | 1.98 |
| E. coli O157 | Quiver (PacBio) | Pilon (Illumina) | 99.94 | 0.9998 | 0.9992 | 5m 01s | 7.50 |
| E. coli O157 | Quiver (PacBio) | Racon (PacBio) | 99.94 | 0.9986 | 0.9980 | 5m 13s | 2.48 |
| E. coli O157:H7 | — | — | 100 | 0.9998 | 0.9998 | 43m 19s | 3.39 |
| E. coli O157:H7 | Apollo (Hybrid) | — | 100 | 0.9999 | 0.9999 | 5h 58m 05s | 8.86 |
| E. coli O157:H7 | Racon (PacBio) | Racon (Illumina) | 100 | 0.9995 | 0.9995 | 9m 43s | 6.56 |
| E. coli O157:H7 | Racon (PacBio) | Racon (PacBio) | 100 | 0.9970 | 0.9970 | 5m 36s | 2.24 |
| E. coli O157:H7 | Pilon (Illumina) | Pilon (Illumina) | 100 | 0.9998 | 0.9998 | 35m 12s | 10.79 |
| E. coli O157:H7 | Pilon (Illumina) | Racon (PacBio) | 100 | 0.9996 | 0.9996 | 6m 04s | 10.75 |
| E. coli K-12 | — | — | 99.98 | 0.9794 | 0.9792 | 34h 21m 46s | 5.06 |
| E. coli K-12 | Apollo (Hybrid) | — | 99.99 | 0.9953 | 0.9952 | 9h 09m 50s | 9.35 |
| E. coli K-12 | Racon (ONT) | Racon (Illumina) | 100 | 0.9996 | 0.9996 | 11m 05s | 5.10 |
| E. coli K-12 | Racon (ONT) | Racon (ONT) | 100 | 0.9851 | 0.9851 | 14m 45s | 4.20 |
| E. coli K-12 | Pilon (Illumina) | Pilon (Illumina) | 99.99 | 0.9993 | 0.9992 | 18m 55s | 8.84 |
| E. coli K-12 | Pilon (Illumina) | Racon (ONT) | 99.99 | 0.9997 | 0.9996 | 15m 51s | 8.84 |
| E. coli K-12 | Nanopolish (ONT) | Nanopolish (ONT) | 99.98 | 0.9929 | 0.9927 | 25h 39m 17s | 4.84 |
| E. coli K-12 | Nanopolish (ONT) | Pilon (Illumina) | 99.99 | 0.9992 | 0.9991 | 9h 45m 01s | 18.10 |
| E. coli K-12 | Nanopolish (ONT) | Racon (ONT) | 100 | 0.9866 | 0.9866 | 9h 42m 24s | 4.54 |
| Yeast S288C | — | — | 99.89 | 0.9998 | 0.9987 | 1h 20m 39s | 6.24 |
| Yeast S288C | Apollo (Hybrid) | — | 99.89 | 0.9998 | 0.9987 | 11h 08m 41s | 6.38 |
| Yeast S288C | Racon (PacBio) | Racon (Illumina) | 99.89 | 0.9994 | 0.9983 | 38m 21s | 6.93 |
| Yeast S288C | Racon (PacBio) | Racon (PacBio) | 99.89 | 0.9949 | 0.9938 | 49m 52s | 6.93 |
| Yeast S288C | Pilon (Illumina) | Pilon (Illumina) | 99.89 | 0.9998 | 0.9987 | 1m 10s | 11.85 |
| Yeast S288C | Pilon (Illumina) | Racon (PacBio) | 99.89 | 0.9960 | 0.9949 | 21m 42s | 11.85 |
| Yeast S288C | Quiver (PacBio) | Quiver (Pacbio) | 98.95 | 0.9998 | 0.9893 | 23m 23s | 2.96 |
| Yeast S288C | Quiver (PacBio) | Pilon (Illumina) | 98.95 | 0.9998 | 0.9893 | 12m 47s | 13.28 |
| Yeast S288C | Quiver (PacBio) | Racon (PacBio) | 98.93 | 0.9968 | 0.9861 | 40m 04s | 6.69 |
| Dataset | First Run | Second Run | Aligned | Accuracy | Polishing | Runtime | Memory |
|---|---|---|---|---|---|---|---|
| Bases (%) | Score | (GB) | |||||
| E. coli O157 | — | — | 94.93 | 0.9000 | 0.8544 | 1m 48s | 10.03 |
| E. coli O157 | Apollo (Hybrid) | — | 98.70 | 0.9866 | 0.9738 | 3h 51m 51s | 12.08 |
| E. coli O157 | Racon (PacBio) | Racon (Illumina) | 99.37 | 0.9992 | 0.9929 | 21m 19s | 22.66 |
| E. coli O157 | Racon (PacBio) | Racon (PacBio) | 99.51 | 0.9980 | 0.9931 | 5m 00s | 2.46 |
| E. coli O157 | Pilon (Illumina) | Pilon (Illumina) | 96.88 | 0.9872 | 0.9564 | 34m 53s | 18.60 |
| E. coli O157 | Pilon (Illumina) | Racon (PacBio) | 98.87 | 0.9970 | 0.9857 | 35m 26s | 18.60 |
| E. coli O157 | Quiver (PacBio) | Quiver (PacBio) | 99.85 | 0.9994 | 0.9979 | 13m 45s | 5.05 |
| E. coli O157 | Quiver (PacBio) | Pilon (Illumina) | 99.80 | 0.9994 | 0.9974 | 9m 42s | 4.76 |
| E. coli O157 | Quiver (PacBio) | Racon (PacBio) | 99.81 | 0.9984 | 0.9965 | 10m 29s | 2.49 |
| E. coli O157:H7 | — | — | 88.56 | 0.8798 | 0.7792 | 2m 57s | 6.27 |
| E. coli O157:H7 | Apollo (Hybrid) | — | 97.53 | 0.9804 | 0.9562 | 2h 54m 55s | 8.34 |
| E. coli O157:H7 | Racon (PacBio) | Racon (Illumina) | 99.02 | 0.9991 | 0.9893 | 9m 24s | 6.56 |
| E. coli O157:H7 | Racon (PacBio) | Racon (PacBio) | 99.22 | 0.9954 | 0.9876 | 5m 31s | 2.24 |
| E. coli O157:H7 | Racon (PacBio) | Pilon (Illumina) | 99.12 | 0.9981 | 0.9893 | 20m 37s | 12.57 |
| E. coli O157:H7 | Pilon (Illumina) | Pilon (Illumina) | 96.32 | 0.9896 | 0.9532 | 35m 12s | 15.84 |
| E. coli K-12 | — | — | 86.68 | 0.8503 | 0.7370 | 4m 04s | 16.47 |
| E. coli K-12 | Apollo (Hybrid) | — | 97.53 | 0.9419 | 0.9186 | 2h 18m 33s | 9.12 |
| E. coli K-12 | Racon (ONT) | Racon (Illumina) | 99.51 | 0.9992 | 0.9943 | 8m 38s | 5.17 |
| E. coli K-12 | Racon (ONT) | Racon (ONT) | 99.78 | 0.9840 | 0.9818 | 11m 43s | 4.06 |
| E. coli K-12 | Pilon (Illumina) | Pilon (Illumina) | 89.61 | 0.9622 | 0.8622 | 32m 03s | 17.78 |
| E. coli K-12 | Pilon (Illumina) | Racon (ONT) | 99.43 | 0.9979 | 0.9922 | 25m 15s | 32.15 |
| E. coli K-12 | Nanopolish (ONT) | Nanopolish (ONT) | 97.35 | 0.9488 | 0.9236 | 241h 56m 10s | 8.49 |
| E. coli K-12 | Nanopolish (ONT) | Pilon (Illumina) | 96.48 | 0.9769 | 0.9425 | 117h 29m 47s | 32.15 |
| E. coli K-12 | Nanopolish (ONT) | Racon (ONT) | 99.62 | 0.9814 | 0.9776 | 117h 08m 16s | 8.49 |
| Yeast S288C | — | — | 95.05 | 0.8923 | 0.8481 | 2m 20s | 16.59 |
| Yeast S288C | Apollo (Hybrid) | — | 98.49 | 0.9709 | 0.9562 | 6h 37m 46s | 5.96 |
| Yeast S288C | Racon (PacBio) | Racon (Illumina) | 99.26 | 0.9986 | 0.9912 | 23m 51s | 6.75 |
| Yeast S288C | Racon (PacBio) | Racon (PacBio) | 99.33 | 0.9937 | 0.9879 | 43m 00s | 6.75 |
| Yeast S288C | Racon (PacBio) | Pilon (Illumina) | 99.23 | 0.9977 | 0.9900 | 22m 07s | 14.86 |
| Yeast S288C | Pilon (Illumina) | Pilon (Illumina) | 95.80 | 0.9595 | 0.9192 | 2m 35s | 15.31 |
| Yeast S288C | Quiver (PacBio) | Quiver (PacBio) | 99.42 | 0.9997 | 0.9939 | 24m 49s | 4.14 |
| Yeast S288C | Quiver (PacBio) | Pilon (Illumina) | 99.45 | 0.9996 | 0.9941 | 12m 23s | 13.40 |
| Yeast S288C | Quiver (PacBio) | Racon (PacBio) | 99.50 | 0.9965 | 0.9915 | 29m 31s | 6.39 |
| Dataset | Assembler | Aligner | Sequencing Tech. | Polishing | Aligned | Accuracy | Polishing | Runtime | Memory |
| of the Reads | Algorithm | Bases (%) | Score | (GB) | |||||
| PacBio | Miniasm | — | — | — | 94.93 | 0.9000 | 0.8544 | 1m 48s | 10.03 |
| PacBio | Miniasm | Minimap2 | PacBio | Apollo | 98.49 | 0.9798 | 0.9650 | 2h 27m 49s | 7.07 |
| PacBio | Miniasm | Minimap2 | PacBio | Pilon | 96.43 | 0.9528 | 0.9188 | 1h 31m 32s | 17.68 |
| PacBio | Miniasm | Minimap2 | PacBio | Racon | 99.35 | 0.9951 | 0.9886 | 2m 13s | 2.44 |
| PacBio | Miniasm | pbalign | PacBio | Quiver | 99.80 | 0.9993 | 0.9973 | 7m 31s | 0.51 |
| PacBio | Miniasm | Minimap2 | Illumina | Apollo | 97.61 | 0.9816 | 0.9581 | 4h 25m 17s | 9.22 |
| PacBio | Miniasm | Minimap2 | Illumina | Pilon | 96.52 | 0.9775 | 0.9435 | 32m 48s | 18.60 |
| PacBio | Miniasm | Minimap2 | Illumina | Racon | 96.45 | 0.9876 | 0.9525 | 14m 09s | 21.57 |
| PacBio | Miniasm | BWA-MEM | Illumina | Apollo | 96.62 | 0.9738 | 0.9409 | 3h 32m 45s | 9.21 |
| PacBio | Miniasm | BWA-MEM | Illumina | Pilon | 96.13 | 0.9693 | 0.9318 | 31m 21s | 18.45 |
| PacBio | Miniasm | BWA-MEM | Illumina | Racon | 96.90 | 0.9813 | 0.9509 | 12m 05s | 20.85 |
| PacBio | Canu | — | — | — | 99.94 | 0.9998 | 0.9992 | 43m 53s | 3.79 |
| PacBio | Canu | Minimap2 | PacBio | Apollo | 99.94 | 0.9997 | 0.9991 | 3h 42m 03s | 8.82 |
| PacBio | Canu | Minimap2 | PacBio | Racon | 99.94 | 0.9986 | 0.9980 | 2m 17s | 2.34 |
| PacBio | Canu | pbalign | PacBio | Quiver | 99.94 | 0.9998 | 0.9992 | 7m 06s | 0.20 |
| PacBio | Canu | BWA-MEM | Illumina | Apollo | 99.94 | 0.9999 | 0.9993 | 4h 49m 15s | 11.05 |
| PacBio | Canu | BWA-MEM | Illumina | Pilon | 99.94 | 0.9998 | 0.9992 | 2m 05s | 11.40 |
| PacBio | Canu | BWA-MEM | Illumina | Racon | 99.94 | 0.9999 | 0.9993 | 14m 58s | 21.04 |
| PacBio (30) | Miniasm∗ | — | — | — | — | — | — | — | — |
| PacBio (30) | Canu | — | — | — | 99.98 | 0.9981 | 0.9979 | 21m 03s | 3.70 |
| PacBio (30) | Canu | Minimap2 | PacBio (30) | Apollo | 99.98 | 0.9982 | 0.9980 | 43m 32s | 8.00 |
| PacBio (30) | Canu | Minimap2 | PacBio (30) | Racon | 99.98 | 0.9980 | 0.9978 | 15s | 0.59 |
| PacBio (30) | Canu | Minimap2 | PacBio (30, Corr.) | Apollo | 99.97 | 0.9976 | 0.9973 | 46m 10s | 7.99 |
| PacBio (30) | Canu | Minimap2 | PacBio (30, Corr.) | Racon | 99.98 | 0.9983 | 0.9981 | 7s | 0.37 |
| PacBio (30) | Canu | BWA-MEM | Illumina | Apollo | 99.98 | 0.9997 | 0.9995 | 4h 48m 31s | 10.35 |
| PacBio (30) | Canu | BWA-MEM | Illumina | Pilon | 99.98 | 0.9998 | 0.9996 | 3m 03s | 8.52 |
| PacBio (30) | Canu | BWA-MEM | Illumina | Racon | 99.98 | 0.9997 | 0.9995 | 14m 42s | 21.04 |
| Dataset | Assembler | Aligner | Sequencing Tech. | Polishing | 11-mer | 21-mer | 31-mer | 51-mer |
| of the Reads | Algorithm | Sim. (%) | Sim. (%) | Sim. (%) | Sim. (%) | |||
| PacBio | Reference | — | — | — | 100 / 100 | 99.89 / 99.98 | 99.92 / 99.96 | 99.66 / 99.96 |
| PacBio | Miniasm | — | — | — | 90.67 / 83.48 | 14.31 / 13.53 | 5.61 / 5.21 | 1.12 / 1.04 |
| PacBio | Miniasm | Minimap2 | PacBio | Apollo | 96.19 / 94.94 | 76.20 / 74.70 | 66.76 / 64.01 | 54.77 / 52.38 |
| PacBio | Miniasm | Minimap2 | PacBio | Pilon | 93.63 / 89.91 | 46.18 / 44.24 | 31.07 / 28.92 | 14.57 / 13.70 |
| PacBio | Miniasm | Minimap2 | PacBio | Racon | 99.47 / 98.70 | 94.89 / 94.11 | 91.11 / 89.05 | 85.22 / 84.67 |
| PacBio | Miniasm | pbalign | PacBio | Quiver | 100 / 99.61 | 99.81 / 99.06 | 99.65 / 98.41 | 99.16 / 98.31 |
| PacBio | Miniasm | Minimap2 | Illumina | Apollo | 97.11 / 95.42 | 83.33 / 82.33 | 78.23 / 76.56 | 71.05 / 69.02 |
| PacBio | Miniasm | Minimap2 | Illumina | Pilon | 96.52 / 93.93 | 83.74 / 80.15 | 82.25 / 77.44 | 79.02 / 74.49 |
| PacBio | Miniasm | Minimap2 | Illumina | Racon | 97.31 / 96.42 | 90.35 / 90.02 | 88.61 / 87.88 | 87.98 / 87.34 |
| PacBio | Miniasm | BWA-MEM | Illumina | Apollo | 96.98 / 94.19 | 80.06 / 77.20 | 75.18 / 72.08 | 67.71 / 64.42 |
| PacBio | Miniasm | BWA-MEM | Illumina | Pilon | 96.32 / 93.20 | 79.65 / 75.30 | 76.75 / 72.32 | 72.92 / 67.16 |
| PacBio | Miniasm | BWA-MEM | Illumina | Racon | 96.91 / 95.10 | 85.89 / 85.27 | 84.00 / 83.88 | 82.36 / 81.06 |
| PacBio | Canu | — | — | — | 100 / 99.93 | 99.63 / 99.78 | 99.46 / 99.42 | 98.93 / 99.00 |
| PacBio | Canu | Minimap2 | PacBio | Apollo | 100 / 99.93 | 99.50 / 99.74 | 99.17 / 99.50 | 98.50 / 99.11 |
| PacBio | Canu | Minimap2 | PacBio | Racon | 99.87 / 99.74 | 98.44 / 98.52 | 97.37 / 97.39 | 95.63 / 95.78 |
| PacBio | Canu | pbalign | PacBio | Quiver | 100 / 100 | 99.80 / 99.72 | 99.67 / 99.44 | 99.40 / 99.25 |
| PacBio | Canu | BWA-MEM | Illumina | Apollo | 100 / 100 | 99.83 / 99.91 | 99.73 / 99.77 | 99.59 / 99.61 |
| PacBio | Canu | BWA-MEM | Illumina | Pilon | 100 / 100 | 99.83 / 99.93 | 99.73 / 99.77 | 99.59 / 99.62 |
| PacBio | Canu | BWA-MEM | Illumina | Racon | 100 / 100 | 99.81 / 99.91 | 99.71 / 99.75 | 99.57 / 99.53 |
| PacBio (30) | Canu | — | — | — | 99.47 / 99.41 | 96.74 / 96.88 | 95.20 / 94.92 | 92.31 / 91.39 |
| PacBio (30) | Canu | Minimap2 | PacBio (30) | Apollo | 99.61 / 99.41 | 97.04 / 97.40 | 95.41 / 95.63 | 92.67 / 92.48 |
| PacBio (30) | Canu | Minimap2 | PacBio (30) | Racon | 99.80 / 99.61 | 97.00 / 97.34 | 95.12 / 95.16 | 92.63 / 92.98 |
| PacBio (30) | Canu | Minimap2 | PacBio (30, Corr.) | Apollo | 99.41 / 99.41 | 97.00 / 97.47 | 95.31 / 95.72 | 92.44 / 92.93 |
| PacBio (30) | Canu | Minimap2 | PacBio (30, Corr.) | Racon | 99.67 / 99.48 | 97.48 / 98.19 | 96.00 / 96.56 | 93.12 / 94.07 |
| PacBio (30) | Canu | BWA-MEM | Illumina | Apollo | 100 / 99.93 | 99.83 / 99.54 | 99.69 / 99.52 | 99.55 / 99.23 |
| PacBio (30) | Canu | BWA-MEM | Illumina | Pilon | 100 / 99.93 | 99.83 / 99.70 | 99.69 / 99.58 | 99.51 / 99.31 |
| PacBio (30) | Canu | BWA-MEM | Illumina | Racon | 100 / 99.93 | 99.89 / 99.63 | 99.73 / 99.62 | 99.55 / 99.29 |
| Dataset | Assembler | Aligner | Sequencing Tech. | Polishing | GC | Mapped | Properly | Avg. | Coverage |
| of the Reads | Algorithm | (%) | Reads (%) | Paired (%) | Coverage | 10 (%) | |||
| PacBio | Reference | — | — | — | 50.48 | 99.92 | 99.49 | 564 | 99.94 |
| PacBio | Miniasm | — | — | — | 49.88 | 92.08 | 87.50 | 434 | 87.90 |
| PacBio | Miniasm | Minimap2 | PacBio | Apollo | 50.28 | 98.74 | 97.43 | 531 | 96.19 |
| PacBio | Miniasm | Minimap2 | PacBio | Pilon | 50.14 | 99.17 | 97.20 | 526 | 93.78 |
| PacBio | Miniasm | Minimap2 | PacBio | Racon | 50.52 | 99.63 | 99.03 | 542 | 98.35 |
| PacBio | Miniasm | pbalign | PacBio | Quiver | 50.56 | 99.83 | 99.40 | 545 | 98.56 |
| PacBio | Miniasm | Minimap2 | Illumina | Apollo | 50.37 | 96.49 | 94.60 | 513 | 93.74 |
| PacBio | Miniasm | Minimap2 | Illumina | Pilon | 50.36 | 95.58 | 92.04 | 499 | 89.57 |
| PacBio | Miniasm | Minimap2 | Illumina | Racon | 50.45 | 96.48 | 94.73 | 514 | 94.11 |
| PacBio | Miniasm | BWA-MEM | Illumina | Apollo | 50.30 | 95.55 | 92.22 | 498 | 89.58 |
| PacBio | Miniasm | BWA-MEM | Illumina | Pilon | 50.30 | 94.48 | 89.64 | 478 | 86.54 |
| PacBio | Miniasm | BWA-MEM | Illumina | Racon | 50.37 | 94.63 | 90.69 | 508 | 90.76 |
| PacBio | Canu | — | — | — | 50.36 | 99.90 | 99.46 | 547 | 99.73 |
| PacBio | Canu | Minimap2 | PacBio | Apollo | 50.36 | 99.90 | 99.46 | 547 | 99.92 |
| PacBio | Canu | Minimap2 | PacBio | Racon | 50.35 | 99.89 | 99.44 | 547 | 99.89 |
| PacBio | Canu | pbalign | PacBio | Quiver | 50.36 | 99.90 | 99.46 | 547 | 99.38 |
| PacBio | Canu | BWA-MEM | Illumina | Apollo | 50.36 | 99.90 | 99.46 | 547 | 99.73 |
| PacBio | Canu | BWA-MEM | Illumina | Pilon | 50.36 | 99.90 | 99.46 | 547 | 99.73 |
| PacBio | Canu | BWA-MEM | Illumina | Racon | 50.36 | 99.90 | 99.46 | 547 | 99.73 |
| PacBio (30) | Canu | — | — | — | 50.44 | 99.89 | 99.42 | 560 | 99.61 |
| PacBio (30) | Canu | Minimap2 | PacBio (30) | Apollo | 50.46 | 99.89 | 99.44 | 560 | 99.91 |
| PacBio (30) | Canu | Minimap2 | PacBio (30) | Racon | 50.44 | 99.89 | 99.43 | 560 | 99.94 |
| PacBio (30) | Canu | Minimap2 | PacBio (30, Corr.) | Apollo | 50.46 | 99.89 | 99.42 | 560 | 99.92 |
| PacBio (30) | Canu | Minimap2 | PacBio (30, Corr.) | Racon | 50.46 | 99.89 | 99.42 | 560 | 99.97 |
| PacBio (30) | Canu | BWA-MEM | Illumina | Apollo | 50.47 | 99.89 | 99.44 | 560 | 99.70 |
| PacBio (30) | Canu | BWA-MEM | Illumina | Pilon | 50.47 | 99.89 | 99.44 | 560 | 99.71 |
| PacBio (30) | Canu | BWA-MEM | Illumina | Racon | 50.47 | 99.89 | 99.43 | 560 | 99.69 |
| Dataset | Assembler | Aligner | Sequencing Tech. | Polishing | Aligned | Accuracy | Polishing | Runtime | Memory |
|---|---|---|---|---|---|---|---|---|---|
| of the Reads | Algorithm | Bases (%) | Score | (GB) | |||||
| PacBio | Miniasm | — | — | — | 88.56 | 0.8798 | 0.7792 | 2m 57s | 6.27 |
| PacBio | Miniasm | Minimap2 | PacBio | Apollo | 96.99 | 0.9636 | 0.9346 | 1h 10m 23s | 7.07 |
| PacBio | Miniasm | Minimap2 | PacBio | Racon | 98.94 | 0.9899 | 0.9794 | 2m 24s | 2.14 |
| PacBio | Miniasm | Minimap2 | Illumina | Apollo | 96.06 | 0.9781 | 0.9396 | 2h 17m 28s | 5.66 |
| PacBio | Miniasm | Minimap2 | Illumina | Pilon | 95.09 | 0.9791 | 0.9310 | 28m 54s | 15.84 |
| PacBio | Miniasm | Minimap2 | Illumina | Racon | 96.17 | 0.9883 | 0.9504 | 4m 39s | 6.29 |
| PacBio | Canu | — | — | — | 100 | 0.9998 | 0.9998 | 43m 19s | 3.39 |
| PacBio | Canu | Minimap2 | PacBio | Apollo | 100 | 0.9997 | 0.9997 | 2h 57m 18s | 7.58 |
| PacBio | Canu | Minimap2 | PacBio | Racon | 100 | 0.9975 | 0.9975 | 2m 50s | 2.23 |
| PacBio | Canu | Minimap2 | Illumina | Apollo | 100 | 0.9997 | 0.9997 | 3h 10m 16s | 6.18 |
| PacBio | Canu | Minimap2 | Illumina | Pilon | 100 | 0.9999 | 0.9999 | 1m 27s | 10.75 |
| PacBio | Canu | Minimap2 | Illumina | Racon | 100 | 0.9996 | 0.9996 | 7m 14s | 6.53 |
| Dataset | Assembler | Aligner | Sequencing Tech. | Polishing | 11-mer | 21-mer | 31-mer | 51-mer |
|---|---|---|---|---|---|---|---|---|
| of the Reads | Algorithm | Sim. (%) | Sim. (%) | Sim. (%) | Sim. (%) | |||
| E. coli O157:H7 | Reference | — | — | — | 99.93 / 100 | 99.78 / 99.94 | 99.73 / 99.96 | 99.70 / 99.92 |
| E. coli O157:H7 | Miniasm | — | — | — | 91.14 / 81.04 | 9.01 / 7.94 | 3.25 / 2.74 | 0.37 / 0.33 |
| E. coli O157:H7 | Miniasm | Minimap2 | PacBio | Apollo | 96.46 / 91.36 | 61.52 / 57.92 | 52.73 / 48.27 | 35.22 / 32.38 |
| E. coli O157:H7 | Miniasm | Minimap2 | PacBio | Racon | 98.10 / 96.95 | 88.45 / 85.70 | 84.37 / 80.22 | 74.61 / 70.87 |
| E. coli O157:H7 | Miniasm | Minimap2 | Illumina | Apollo | 97.97 / 93.43 | 81.92 / 78.79 | 77.05 / 72.69 | 66.69 / 63.21 |
| E. coli O157:H7 | Miniasm | Minimap2 | Illumina | Pilon | 97.64 / 92.25 | 85.57 / 79.87 | 84.74 / 78.02 | 80.92 / 75.85 |
| E. coli O157:H7 | Miniasm | Minimap2 | Illumina | Racon | 98.36 / 94.57 | 91.28 / 89.04 | 90.77 / 87.49 | 88.78 / 87.23 |
| E. coli O157:H7 | Canu | — | — | — | 99.80 / 99.93 | 99.41 / 99.57 | 99.13 / 99.46 | 99.06 / 98.99 |
| E. coli O157:H7 | Canu | Minimap2 | PacBio | Apollo | 99.80 / 99.93 | 99.35 / 99.57 | 99.08 / 99.44 | 98.82 / 98.88 |
| E. coli O157:H7 | Canu | Minimap2 | PacBio | Racon | 99.54 / 99.61 | 96.81 / 96.67 | 95.27 / 95.22 | 91.84 / 91.72 |
| E. coli O157:H7 | Canu | Minimap2 | Illumina | Apollo | 99.93 / 99.93 | 99.48 / 99.85 | 99.17 / 99.71 | 98.95 / 99.70 |
| E. coli O157:H7 | Canu | Minimap2 | Illumina | Pilon | 99.87 / 100 | 99.78 / 99.91 | 99.73 / 99.88 | 99.70 / 99.79 |
| E. coli O157:H7 | Canu | Minimap2 | Illumina | Racon | 99.80 / 100 | 99.31 / 99.83 | 99.00 / 99.88 | 98.63 / 99.47 |
| Dataset | Assembler | Aligner | Sequencing Tech. | Polishing | GC | Mapped | Properly | Avg. | Coverage |
| of the Reads | Algorithm | (%) | Reads (%) | Paired (%) | Coverage | 10 (%) | |||
| E. coli O157:H7 | Reference | — | — | — | 50.43 | 97.42 | 94.3 | 183 | 99.93 |
| E. coli O157:H7 | Miniasm | — | — | — | 49.61 | 80.51 | 68.24 | 108 | 76.01 |
| E. coli O157:H7 | Miniasm | Minimap2 | PacBio | Apollo | 50.09 | 95.0 | 88.69 | 163 | 91.74 |
| E. coli O157:H7 | Miniasm | Minimap2 | PacBio | Racon | 50.55 | 97.03 | 93.06 | 173 | 96.59 |
| E. coli O157:H7 | Miniasm | Minimap2 | Illumina | Apollo | 50.39 | 93.6 | 87.69 | 162 | 90.65 |
| E. coli O157:H7 | Miniasm | Minimap2 | Illumina | Pilon | 50.36 | 93.01 | 85.66 | 159 | 86.75 |
| E. coli O157:H7 | Miniasm | Minimap2 | Illumina | Racon | 50.48 | 93.84 | 88.52 | 163 | 91.67 |
| E. coli O157:H7 | Canu | — | — | — | 50.43 | 97.42 | 94.32 | 182 | 99.71 |
| E. coli O157:H7 | Canu | Minimap2 | PacBio | Apollo | 50.44 | 97.42 | 94.32 | 182 | 99.87 |
| E. coli O157:H7 | Canu | Minimap2 | PacBio | Racon | 50.41 | 97.4 | 94.22 | 182 | 99.73 |
| E. coli O157:H7 | Canu | Minimap2 | Illumina | Apollo | 50.45 | 97.42 | 94.31 | 182 | 99.95 |
| E. coli O157:H7 | Canu | Minimap2 | Illumina | Pilon | 50.44 | 97.42 | 94.33 | 182 | 99.71 |
| E. coli O157:H7 | Canu | Minimap2 | Illumina | Racon | 50.45 | 97.42 | 94.29 | 182 | 99.98 |
| Dataset | Assembler | Aligner | Sequencing Tech. | Polishing | Aligned | Accuracy | Polishing | Runtime | Memory |
| of the Reads | Algorithm | Bases (%) | Score | (GB) | |||||
| ONT | Miniasm | — | — | — | 86.68 | 0.8503 | 0.7370 | 4m 04s | 16.47 |
| ONT | Miniasm | Minimap2 | ONT | Apollo | 97.50 | 0.9209 | 0.8979 | 1h 40m 08s | 7.96 |
| ONT | Miniasm | Minimap2 | ONT | Nanopolish | 96.01 | 0.9182 | 0.8816 | 117h 02m 10s | 8.49 |
| ONT | Miniasm | Minimap2 | ONT | Racon | 99.41 | 0.9769 | 0.9711 | 4m 55s | 3.70 |
| ONT | Miniasm | Minimap2 | Illumina | Apollo | 89.41 | 0.9291 | 0.8307 | 54m 46s | 6.20 |
| ONT | Miniasm | Minimap2 | Illumina | Pilon | 89.22 | 0.9310 | 0.8306 | 17m 28s | 10.58 |
| ONT | Canu | — | — | — | 99.98 | 0.9794 | 0.9792 | 34h 21m 46s | 5.06 |
| ONT | Canu | Minimap2 | ONT | Apollo | 99.99 | 0.9803 | 0.9802 | 6h 08m 05s | 8.09 |
| ONT | Canu | Minimap2 | ONT | Nanopolish | 99.98 | 0.9925 | 0.9923 | 9h 35m 26s | 4.54 |
| ONT | Canu | Minimap2 | ONT | Racon | 100 | 0.9840 | 0.9840 | 7m 22s | 4.20 |
| ONT | Canu | Minimap2 | Illumina | Apollo | 99.96 | 0.9982 | 0.9978 | 2h 09m 47s | 6.43 |
| ONT | Canu | Minimap2 | Illumina | Pilon | 99.99 | 0.9987 | 0.9986 | 11m 59s | 8.84 |
| ONT (30) | Miniasm∗ | — | — | — | — | — | — | — | — |
| ONT (30) | Canu | — | — | — | 99.98 | 0.9744 | 0.9742 | 3h 17m 47s | 4.54 |
| ONT (30) | Canu | Minimap2 | ONT (30) | Apollo | 99.98 | 0.9752 | 0.9750 | 40m 37s | 7.74 |
| ONT (30) | Canu | Minimap2 | ONT (30) | Nanopolish | 99.99 | 0.9857 | 0.9856 | 4h 07m 06s | 2.15 |
| ONT (30) | Canu | Minimap2 | ONT (30) | Racon | 100 | 0.9825 | 0.9825 | 20s | 0.59 |
| ONT (30) | Canu | Minimap2 | ONT (30, Corr) | Apollo | 99.96 | 0.9755 | 0.9751 | 46m 40s | 7.75 |
| ONT (30) | Canu | Minimap2 | ONT (30, Corr) | Racon | 100 | 0.9799 | 0.9799 | 9s | 0.42 |
| Dataset | Assembler | Aligner | Sequencing Tech. | Polishing | 11-mer | 21-mer | 31-mer | 51-mer |
| of the Reads | Algorithm | Sim. (%) | Sim. (%) | Sim. (%) | Sim. (%) | |||
| ONT | Reference | — | — | — | 99.79 / 100 | 99.37 / 99.70 | 99.35 / 99.51 | 99.22 / 99.65 |
| ONT | Miniasm | — | — | — | 82.92 / 80.97 | 13.49 / 14.57 | 5.40 / 5.59 | 1.22 / 1.29 |
| ONT | Miniasm | Minimap2 | ONT | Apollo | 88.09 / 87.01 | 39.46 / 41.06 | 26.10 / 27.06 | 12.11 / 12.20 |
| ONT | Miniasm | Minimap2 | ONT | Nanopolish | 89.67 / 87.09 | 47.47 / 48.79 | 38.19 / 37.81 | 25.04 / 25.30 |
| ONT | Miniasm | Minimap2 | ONT | Racon | 93.25 / 95.02 | 75.24 / 74.16 | 63.69 / 63.36 | 48.72 / 47.87 |
| ONT | Miniasm | Minimap2 | Illumina | Apollo | 91.25 / 87.17 | 50.96 / 53.20 | 44.37 / 44.28 | 32.54 / 32.75 |
| ONT | Miniasm | Minimap2 | Illumina | Pilon | 89.60 / 86.45 | 56.27 / 58.38 | 51.30 / 52.20 | 44.28 / 45.29 |
| ONT | Canu | — | — | — | 92.08 / 95.91 | 76.08 / 76.15 | 66.05 / 66.09 | 49.94 / 49.87 |
| ONT | Canu | Minimap2 | ONT | Apollo | 92.15 / 95.91 | 76.93 / 77.04 | 67.52 / 67.37 | 51.35 / 50.94 |
| ONT | Canu | Minimap2 | ONT | Nanopolish | 97.04 / 98.60 | 90.74 / 91.32 | 86.95 / 86.20 | 79.33 / 78.49 |
| ONT | Canu | Minimap2 | ONT | Racon | 94.49 / 96.89 | 80.33 / 80.42 | 72.03 / 71.58 | 57.39 / 56.79 |
| ONT | Canu | Minimap2 | Illumina | Apollo | 99.24 / 99.65 | 97.72 / 97.88 | 97.35 / 96.94 | 96.26 / 95.82 |
| ONT | Canu | Minimap2 | Illumina | Pilon | 99.59 / 99.59 | 98.10 / 98.39 | 98.37 / 97.70 | 97.06 / 96.46 |
| ONT (30) | Canu | — | — | — | 90.36 / 94.87 | 71.60 / 72.15 | 59.89 / 59.83 | 41.94 / 42.42 |
| ONT (30) | Canu | Minimap2 | ONT (30) | Apollo | 91.05 / 94.84 | 72.62 / 73.06 | 60.96 / 61.17 | 43.50 / 43.84 |
| ONT (30) | Canu | Minimap2 | ONT (30) | Nanopolish | 95.94 / 96.80 | 83.00 / 82.30 | 75.37 / 73.76 | 61.85 / 60.96 |
| ONT (30) | Canu | Minimap2 | ONT (30) | Racon | 93.73 / 96.46 | 79.13 / 78.91 | 68.55 / 68.62 | 53.17 / 53.00 |
| ONT (30) | Canu | Minimap2 | ONT (30, Corr) | Apollo | 91.05 / 94.97 | 72.64 / 73.56 | 61.21 / 62.11 | 42.89 / 43.54 |
| ONT (30) | Canu | Minimap2 | ONT (30, Corr) | Racon | 92.08 / 95.91 | 74.79 / 76.08 | 64.47 / 65.09 | 47.06 / 47.38 |
| Dataset | Assembler | Aligner | Sequencing Tech. | Polishing | GC | Mapped | Properly | Avg. | Coverage |
| of the Reads | Algorithm | (%) | Reads (%) | Paired (%) | Coverage | 10 (%) | |||
| ONT | Reference | — | — | — | 50.79 | 99.70 | 98.96 | 237 | 99.55 |
| ONT | Miniasm | — | — | — | 52.62 | 90.85 | 82.50 | 147 | 75.72 |
| ONT | Miniasm | Minimap2 | ONT | Apollo | 52.23 | 97.44 | 94.28 | 216 | 94.84 |
| ONT | Miniasm | Minimap2 | ONT | Nanopolish | 52.10 | 96.97 | 90.32 | 200 | 90.35 |
| ONT | Miniasm | Minimap2 | ONT | Racon | 51.12 | 99.09 | 97.71 | 234 | 98.51 |
| ONT | Miniasm | Minimap2 | Illumina | Apollo | 51.89 | 92.90 | 86.52 | 181 | 80.33 |
| ONT | Miniasm | Minimap2 | Illumina | Pilon | 52.11 | 92.59 | 85.77 | 175 | 78.64 |
| ONT | Canu | — | — | — | 51.05 | 99.61 | 98.71 | 233 | 98.75 |
| ONT | Canu | Minimap2 | ONT | Apollo | 50.90 | 99.67 | 98.57 | 234 | 98.31 |
| ONT | Canu | Minimap2 | ONT | Nanopolish | 51.04 | 99.66 | 98.83 | 234 | 98.77 |
| ONT | Canu | Minimap2 | ONT | Racon | 51.01 | 99.65 | 98.75 | 234 | 99.24 |
| ONT | Canu | Minimap2 | Illumina | Apollo | 50.81 | 99.68 | 98.80 | 235 | 98.58 |
| ONT | Canu | Minimap2 | Illumina | Pilon | 50.80 | 99.68 | 98.77 | 235 | 98.76 |
| ONT (30) | Canu | — | — | — | 51.11 | 99.60 | 98.57 | 234 | 99.04 |
| ONT (30) | Canu | Minimap2 | ONT (30) | Apollo | 51.14 | 99.60 | 98.59 | 234 | 99.19 |
| ONT (30) | Canu | Minimap2 | ONT (30) | Nanopolish | 51.12 | 99.65 | 98.72 | 235 | 98.92 |
| ONT (30) | Canu | Minimap2 | ONT (30) | Racon | 51.05 | 99.64 | 98.78 | 234 | 99.35 |
| ONT (30) | Canu | Minimap2 | ONT (30, Corr) | Apollo | 51.14 | 99.60 | 98.65 | 234 | 99.28 |
| ONT (30) | Canu | Minimap2 | ONT (30, Corr) | Racon | 51.08 | 99.63 | 98.80 | 234 | 99.40 |
| Dataset | Assembler | Aligner | Sequencing Tech. | Polishing | Aligned | Accuracy | Polishing | Runtime | Memory |
| of the Reads | Algorithm | Bases (%) | Score | (GB) | |||||
| PacBio | Miniasm | — | — | — | 95.05 | 0.8923 | 0.8481 | 2m 23s | 16.59 |
| PacBio | Miniasm | Minimap2 | PacBio | Apollo | 98.44 | 0.9706 | 0.9555 | 6h 53m 51s | 4.62 |
| PacBio | Miniasm | Minimap2 | PacBio | Racon | 99.15 | 0.9895 | 0.9811 | 18m 55s | 6.63 |
| PacBio | Miniasm | Minimap2 | PacBio | Quiver | 99.44 | 0.9995 | 0.9939 | 16m 11s | 0.26 |
| PacBio | Miniasm | Minimap2 | Illumina | Apollo | 97.26 | 0.9733 | 0.9466 | 2h 05m 58s | 2.83 |
| PacBio | Miniasm | Minimap2 | Illumina | Pilon | 97.06 | 0.9761 | 0.9474 | 4m 00s | 26.64 |
| PacBio | Miniasm | Minimap2 | Illumina | Racon | 97.27 | 0.9835 | 0.9567 | 5m 00s | 7.34 |
| PacBio | Canu | — | — | — | 99.89 | 0.9998 | 0.9987 | 1h 20m 39s | 6.24 |
| PacBio | Canu | Minimap2 | PacBio | Apollo | 98.95 | 0.9997 | 0.9892 | 10h 59m 10s | 5.05 |
| PacBio | Canu | Minimap2 | PacBio | Racon | 98.93 | 0.9964 | 0.9857 | 19m 16s | 6.82 |
| PacBio | Canu | Minimap2 | PacBio | Quiver | 98.95 | 0.9998 | 0.9893 | 12m 02s | 0.29 |
| PacBio | Canu | Minimap2 | Illumina | Apollo | 98.95 | 0.9998 | 0.9893 | 1h 22m 24s | 3.25 |
| PacBio | Canu | Minimap2 | Illumina | Pilon | 98.95 | 0.9998 | 0.9893 | 43s | 13.83 |
| PacBio | Canu | Minimap2 | Illumina | Racon | 98.95 | 0.9998 | 0.9893 | 2m 55s | 5.15 |
| Dataset | Assembler | Aligner | Sequencing Tech. | Polishing | 11-mer | 21-mer | 31-mer | 51-mer |
| of the Reads | Algorithm | Sim. (%) | Sim. (%) | Sim. (%) | Sim. (%) | |||
| Yeast S288C | Reference | — | — | — | 100 / 100 | 99.96 / 99.87 | 99.87 / 99.71 | 99.73 / 99.59 |
| Yeast S288C | Miniasm | — | — | — | 95.49 / 91.36 | 12.06 / 10.85 | 4.38 / 3.84 | 0.62 / 0.55 |
| Yeast S288C | Miniasm | Minimap2 | PacBio | Apollo | 98.79 / 96.71 | 65.93 / 62.88 | 53.80 / 50.13 | 35.83 / 33.02 |
| Yeast S288C | Miniasm | Minimap2 | PacBio | Racon | 99.39 / 98.63 | 88.15 / 86.21 | 82.35 / 79.89 | 72.60 / 69.48 |
| Yeast S288C | Miniasm | Minimap2 | PacBio | Quiver | 99.89 / 99.34 | 99.38 / 98.42 | 99.07 / 98.19 | 98.98 / 97.63 |
| Yeast S288C | Miniasm | Minimap2 | Illumina | Apollo | 98.35 / 96.65 | 77.96 / 74.13 | 69.85 / 66.35 | 59.06 / 55.89 |
| Yeast S288C | Miniasm | Minimap2 | Illumina | Pilon | 98.84 / 96.25 | 84.87 / 79.60 | 82.25 / 77.24 | 80.12 / 74.60 |
| Yeast S288C | Miniasm | Minimap2 | Illumina | Racon | 98.51 / 97.18 | 89.53 / 87.02 | 87.49 / 84.96 | 87.02 / 83.89 |
| Yeast S288C | Canu | — | — | — | 100 / 99.45 | 99.91 / 99.09 | 99.86 / 98.97 | 99.60 / 98.56 |
| Yeast S288C | Canu | Minimap2 | PacBio | Apollo | 99.94 / 99.45 | 99.87 / 99.11 | 99.74 / 98.95 | 99.46 / 98.58 |
| Yeast S288C | Canu | Minimap2 | PacBio | Racon | 99.94 / 99.40 | 96.37 / 94.96 | 94.20 / 92.48 | 89.17 / 87.70 |
| Yeast S288C | Canu | Minimap2 | PacBio | Quiver | 100 / 99.62 | 99.93 / 99.19 | 99.89 / 98.95 | 99.76 / 98.69 |
| Yeast S288C | Canu | Minimap2 | Illumina | Apollo | 100 / 99.45 | 99.92 / 99.10 | 99.88 / 98.93 | 99.68 / 98.58 |
| Yeast S288C | Canu | Minimap2 | Illumina | Pilon | 100 / 99.45 | 99.94 / 99.13 | 99.89 / 98.95 | 99.74 / 98.69 |
| Yeast S288C | Canu | Minimap2 | Illumina | Racon | 100 / 99.45 | 99.94 / 99.15 | 99.89 / 98.95 | 99.75 / 98.67 |
| Dataset | Assembler | Aligner | Sequencing Tech. | Polishing | GC | Mapped | Properly | Avg. | Coverage |
| of the Reads | Algorithm | (%) | Reads (%) | Paired (%) | Coverage | 10 (%) | |||
| Yeast S288C | Reference | — | — | — | 38.30 | 99.94 | 99.71 | 73 | 99.95 |
| Yeast S288C | Miniasm | — | — | — | 38.42 | 93.88 | 83.94 | 57 | 82.63 |
| Yeast S288C | Miniasm | Minimap2 | PacBio | Apollo | 38.00 | 99.11 | 97.45 | 69 | 94.38 |
| Yeast S288C | Miniasm | Minimap2 | PacBio | Racon | 38.26 | 99.51 | 98.64 | 70 | 96.34 |
| Yeast S288C | Miniasm | Minimap2 | PacBio | Quiver | 38.39 | 99.61 | 99.29 | 71 | 98.04 |
| Yeast S288C | Miniasm | Minimap2 | Illumina | Apollo | 38.22 | 97.10 | 94.98 | 66 | 90.53 |
| Yeast S288C | Miniasm | Minimap2 | Illumina | Pilon | 38.41 | 96.86 | 88.65 | 66 | 87.78 |
| Yeast S288C | Miniasm | Minimap2 | Illumina | Racon | 38.42 | 97.03 | 95.33 | 66 | 91.35 |
| Yeast S288C | Canu | — | — | — | 38.17 | 99.94 | 99.73 | 71 | 98.81 |
| Yeast S288C | Canu | Minimap2 | PacBio | Apollo | 38.17 | 99.94 | 99.73 | 71 | 98.83 |
| Yeast S288C | Canu | Minimap2 | PacBio | Racon | 38.09 | 99.94 | 99.23 | 71 | 98.21 |
| Yeast S288C | Canu | Minimap2 | PacBio | Quiver | 38.17 | 99.94 | 99.74 | 71 | 98.74 |
| Yeast S288C | Canu | Minimap2 | Illumina | Apollo | 38.17 | 99.94 | 99.73 | 71 | 98.81 |
| Yeast S288C | Canu | Minimap2 | Illumina | Pilon | 38.17 | 99.94 | 99.74 | 71 | 98.81 |
| Yeast S288C | Canu | Minimap2 | Illumina | Racon | 38.17 | 99.94 | 99.73 | 71 | 98.81 |
| Dataset | Assembler | Aligner | Polishing | 21-mer | 31-mer | 51-mer |
| Algorithm | Sim. (%) | Sim. (%) | Sim. (%) | |||
| Human HG002 | Reference | — | — | 98.05 / 87.02 | 96.98 / 84.73 | 93.56 / 80.14 |
| Human HG002 | Minimap2 | PacBio | Apollo | 93.74 / 82.62 | 91.05 / 79.18 | 85.26 / 73.11 |
| Human HG002 | Minimap2 | PacBio | Quiver∗ | 94.55 / 83.49 | 91.50 / 79.47 | 84.95 / 72.36 |
| Human HG002 | Minimap2 | PacBio | Racon∗ | 85.96 / 74.53 | 79.07 / 67.58 | 67.19 / 56.73 |
| Human HG002 | Minimap2 | Illumina | Apollo | 98.33 / 87.22 | 97.41 / 85.05 | 94.26 / 80.64 |
| Human HG002 | BWA-MEM | Illumina | Apollo | 98.32 / 87.17 | 97.39 / 84.98 | 94.23 / 80.57 |
| Human HG002 | BWA-MEM | Illumina | Pilon∗ | 98.19 / 87.14 | 97.23 / 84.95 | 93.99 / 80.49 |
| Human HG002 | Minimap2 | PacBio (9) | Apollo | 54.00 / 43.72 | 45.59 / 36.91 | 36.82 / 30.24 |
| Human HG002 | BWA-MEM | PacBio (9) | Apollo | 53.97 / 42.76 | 45.61 / 36.10 | 36.95 / 29.66 |
| Human HG002 | Minimap2 | PacBio (9) | Racon | 48.93 / 37.77 | 39.97 / 31.08 | 31.04 / 24.62 |
| Human HG002 | BWA-MEM | PacBio (9) | Racon | 46.83 / 34.91 | 37.69 / 28.35 | 28.67 / 22.07 |
| Dataset | Aligner | Sequencing Tech. | Polishing | GC | Mapped | Properly | Avg. | Coverage |
|---|---|---|---|---|---|---|---|---|
| of the Reads | Algorithm | (%) | Reads (%) | Paired (%) | Coverage | 10 (%) | ||
| Human HG002 | — | — | — | 40.86 | 99.92 | 98.35 | 10 | 44.82 |
| Human HG002 | Minimap2 | PacBio | Apollo | 40.81 | 99.91 | 97.75 | 10 | 44.81 |
| Human HG002 | Minimap2 | PacBio | Quiver | 40.84 | 99.92 | 98.21 | 10 | 44.55 |
| Human HG002 | Minimap2 | PacBio | Racon∗ | 40.74 | 99.89 | 97.34 | 10 | 44.30 |
| Human HG002 | Minimap2 | Illumina | Apollo | 40.86 | 99.92 | 98.21 | 10 | 44.93 |
| Human HG002 | BWA-MEM | Illumina | Apollo | 40.86 | 99.92 | 98.19 | 10 | 44.90 |
| Human HG002 | BWA-MEM | Illumina | Pilon∗ | 40.86 | 99.92 | 98.22 | 10 | 44.86 |
| Human HG002 | Minimap2 | PacBio (9) | Apollo | 40.62 | 99.36 | 83.34 | 10 | 37.17 |
| Human HG002 | BWA-MEM | PacBio (9) | Apollo | 40.62 | 99.29 | 82.54 | 10 | 36.04 |
| Human HG002 | Minimap2 | PacBio (9) | Racon | 40.95 | 98.00 | 78.70 | 9 | 33.82 |
| Human HG002 | BWA-MEM | PacBio (9) | Racon | 40.94 | 97.27 | 76.30 | 9 | 32.07 |
| Dataset for | Assembler | Aligner | Platform of the | Number of | Runtime | Memory |
| the Assembly | Aligned Reads | Alignments | (GB) | |||
| E. coli K-12 - ONT | Miniasm | Minimap2 | ONT | 8,095,856 | 3m 30s | 4.88 |
| E. coli K-12 - ONT | Canu | Minimap2 | ONT | 1,662,306 | 39s | 2.10 |
| E. coli K-12 - ONT (30) | Canu | Minimap2 | ONT (30) | 170,910 | 6s | 0.60 |
| E. coli O157 - PacBio | Miniasm | Minimap2 | PacBio | 732,397 | 25s | 1.79 |
| E. coli O157 - PacBio | Miniasm | Minimap2 | Illumina | 21,933,051 | 1m 35s | 3.16 |
| E. coli O157 - PacBio | Canu | Minimap2 | PacBio | 741,343 | 22s | 1.80 |
| E. coli O157 - PacBio (30) | Canu | Minimap2 | PacBio (30) | 148,241 | 5s | 0.67 |
| E. coli O157 - PacBio (30) | Canu | Minimap2 | PacBio (30, Corr) | 137,620 | 3s | 0.47 |
| E. coli O157 - PacBio | Miniasm | BWA-MEM | Illumina | 19,799,002 | 2m 34s | 3.17 |
| E. coli O157 - PacBio | Canu | BWA-MEM | Illumina | 23,328,379 | 1m 16s | 2.89 |
| E. coli O157 - PacBio (30) | Canu | BWA-MEM | Illumina | 23,326,202 | 1m 20s | 2.96 |
| E. coli O157 - PacBio | Miniasm | pbalign | PacBio | 49,561 | 12m 55s | 6.36 |
| E. coli O157 - PacBio | Canu | pbalign | PacBio | 51,994 | 11m 29s | 6.28 |
| Long Read | Contig Chunk | Aligned | Aligned | Accuracy |
|---|---|---|---|---|
| Chunk Size | Size | Bases | Bases (%) | |
| 1000 | Original | 5,708,747 | 98.49 | 0.9798 |
| 1000 | 25000 | 5,487,736 | 94.46 | 0.9733 |
| 1000 | 50000 | 5,689,120 | 97.95 | 0.9728 |
| 1000 | 100000 | 5,493,663 | 94.52 | 0.9727 |
| 5000 | 25000 | 5,430,700 | 93.06 | 0.8974 |
| 5000 | 50000 | 5,411,163 | 92.68 | 0.8971 |
| 5000 | 100000 | 5,516,599 | 94.49 | 0.8970 |
| 10000 | 25000 | 5,415,333 | 92.65 | 0.8918 |
| 10000 | 50000 | 5,423,340 | 92.75 | 0.8914 |
| 10000 | 100000 | 5,474,159 | 93.61 | 0.8914 |
| Max | Filter | Aligned | Aligned | Accuracy |
|---|---|---|---|---|
| Deletion (-d) | Size (-f) | Bases | Bases (%) | |
| 3 | 100 | 5,699,182 | 97.91 | 0.9739 |
| 5 | 100 | 5,696,138 | 97.93 | 0.9735 |
| 15 | 100 | 5,678,838 | 97.90 | 0.9731 |
| 3 | 200 | 5,705,130 | 98.12 | 0.9751 |
| 5 | 200 | 5,704,582 | 98.12 | 0.9750 |
| 15 | 200 | 5,702,478 | 98.14 | 0.9751 |
| Max | Filter | Aligned | Aligned | Accuracy |
|---|---|---|---|---|
| Insertion (-i) | Size (-f) | Bases | Bases (%) | |
| 1 | 100 | 5,685,635 | 97.89 | 0.9660 |
| 5 | 100 | 5,638,585 | 97.62 | 0.9696 |
| 10 | 100 | 5,365,978 | 95.54 | 0.9531 |
| 1 | 200 | 5,685,040 | 98.02 | 0.9668 |
| 5 | 200 | 5,692,813 | 98.07 | 0.9740 |
| 10 | 200 | 5,623,736 | 97.62 | 0.9701 |
| Match Transition | Insertion Transition | Filter | Aligned | Aligned | Accuracy |
|---|---|---|---|---|---|
| Probability (-tm) | Probability (-ti) | Size (-f) | Bases | Bases (%) | |
| 0.60 | 0.25 | 100 | 5,670,852 | 97.95 | 0.9625 |
| 0.60 | 0.30 | 100 | 5,660,957 | 97.90 | 0.9596 |
| 0.80 | 0.10 | 100 | 5,699,660 | 98.02 | 0.9788 |
| 0.90 | 0.05 | 100 | 5,685,770 | 97.89 | 0.9774 |
| 0.60 | 0.25 | 200 | 5,682,512 | 98.10 | 0.9644 |
| 0.60 | 0.30 | 200 | 5,681,993 | 98.13 | 0.9618 |
| 0.80 | 0.10 | 200 | 5,707,293 | 98.16 | 0.9803 |
| 0.90 | 0.05 | 200 | 5,695,902 | 98.05 | 0.9789 |
| Aligner | Parameters |
|---|---|
| BWA-MEM | -t 45 |
| Minimap2 (for PacBio) | -x map-pb -a -t 45 |
| Minimap2 (for ONT) | -x map-ont -a -t 45 |
| Minimap2 (for Illumina) | -a -x sr -t 45 |
| pbalign | –nproc 45 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Supplementary Material for
Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm
Can Firtina, Jeremie S. Kim, Mohammed Alser, Damla Senol Cali, A. Ercument Cicek,
Can Alkan, and Onur Mutlu
1 Constructing a profile hidden Markov model graph
Apollo constructs a profile hidden Markov model graph (pHMM-graph) to represent the sequences of contig as well as the errors that a contig may have. A pHMM-graph includes states and directed transitions from a state to another. There are two types of probabilities that the graph contains: (1) emission and (2) transition probabilities. First, each state has emission probabilities for emitting certain characters where each character is associated with a probability value with the range . Each emission probability reveals how likely it is to emit (e.g., consume or output) a certain character when a certain state is visited. Second, each transition is associated with a probability value with the range . A transition probability shows the probability of visiting a state from a certain state. Thus, one can calculate the likelihood of emitting all the characters in a given sequence by traversing a certain path in the graph.
The structure of the pHMM-graph allows us to handle insertion, deletion, and substitution errors by following certain states and transitions. Now, we will explain the structure of the graph in detail. For an assembly contig , let us define the pHMM-graph that represents the contig as . Let us also define the length of the contig as . A base has one of the letters in the alphabet set . Thus, a state emits one of the characters in with a certain probability. For a state , We denote the emission probability of a base as where . We denote the transition probability from a state, i, to another state, j, as . For the set of the states that the state has an outgoing transition to, , we have . Now let us define in four steps how Apollo constructs the states and the transitions of the graph :
First, Apollo constructs a start state, , and an end state . Second, for each base where , Apollo constructs a match state as follows (Figure S1):
- •
A match state that we denote as for the base where s.t. and (i.e., if the base of the contig is , then the corresponding match state is ). For the following steps, let us assume
- •
A match emission with the probability , for the base s.t. . is a parameter to Apollo.
- •
A substitution emission with the probability , for each base and s.t. (Note that ). is a parameter to Apollo.
- •
A match transition with the probability , from the match state to the next match state s.t. . is a parameter to Apollo.
Third, for each base where , Apollo constructs the insertion states as follows (Figure S2):
- •
There are many insertion states, , , …, , where , and is a parameter to Apollo
- •
The match state, , has an insertion transition to , with the probability s.t.
- •
For each where , the insertion state has an insertion transition to the next insertion state with the probability s.t.
- •
For each where , the insertion state has a match transition to the match state of the next base with the probability s.t.
- •
The last insertion state, , has no further insertion transitions. Instead, it has a transition to the match state of the next base with the probability s.t.
- •
For each where , each base and has an insertion emission probability for the insertion state s.t. and . Note that . (i.e., if the base at the location is T, then , , , and ).
Fourth step for finalizing the complete structure of the pHMM graph, for each state , Apollo constructs the deletion transitions as follows (Figure S3):
- •
Let us define , which is the overall deletion transition probability.
- •
There are many deletion transitions from the state , to the further match states. is a parameter to Apollo.
- •
We assume that a transition deletes the bases if it skips the corresponding match states of the bases. We denote the transition probability of a deletion transition as s.t. , if it deletes many bases in a row in one transition. Apollo calculates the deletion transition probability using the normalized version of a polynomial distribution where is a factor value for the equation:
[TABLE]
- •
If the value is set to , then the each deletion transition is equally likely (i.e., , if ). As the value increases, the probability of deleting more bases in one transition decreases accordingly (i.e., , if ). is a parameter to Apollo.
We note that the start state also has a match transition to and deletion transitions as defined previously. There are al many insertion states, , , …, , between the start state and the first match state . The transitions of these insertion states are also identical to what we described before. We would also like to note that the end state has no outgoing transition. The prior states consider as a match state and connect to it accordingly. The start and end states have no emission probabilities.
Note that the design of pHMM-graph described here and proposed in Hercules [1] is different from the conventional pHMM-graphs [2]. One significant difference is that the conventional pHMM-graphs have deletion states for each match state whereas the pHMM-graph model of Apollo uses deletion transitions instead of states. In the conventional model, visiting deletion states does not consume (i.e., emit) a character from a given sequence (i.e., observation). Therefore, this requires storing extra "position" information that tells which character should be consumed given a state at iteration (i.e., in each transition from a state to another). We want to make sure that each state consumes only one character (and no more) when visited to prevent storing the extra position information. In Apollo’s design, iteration number equals the position of a character that is being consumed Apollo’s states consume exactly one character. This allows us to remove an entire dimension, the iteration number , which greatly helps us to reduce both memory requirements and runtime while calculating the Forward-Backward values.
2 The Forward-Backward and Baum-Welch Algorithms
Apollo uses the region of a pHMM-graph (i.e., sub-graph) that a read (i.e., observation or a sequence) is aligned to in order to calculate the likelihood of each state emitting a certain base at position in the aligned read. However, this does not mean that position is known since we need to consider the fact that an unknown number of insertion and deletion errors may have occurred when number of transitions is followed from the start state to a certain state. Therefore, states should be measuring the likelihood of emitting a character at position where is a number in range where is the number of transitions that was taken so far. In the no error case, we have . Apollo uses reads as observations for the Forward-Backward algorithm [3] in order to calculate the likelihoods per state. These likelihoods are calculated based on initial transition and emission probabilities of a pHMM-graph and the read itself. Apollo uses these likelihoods to make the contig similar to the aligned read. Apollo, then, trains the pHMM-graph of a contig per each read that aligns to the contig using the Baum-Welch algorithm [3]. We describe the details of both the Forward-Backward and the Baum-Welch algorithms in the following paragraphs.
For each read aligning to a contig, Apollo uses the alignment location and the sequence of the read in order to train the pHMM-graph. First, per each aligned read sequence , Apollo extracts the sub-graph that corresponds to the aligned region of the contig where we have , , match and insertion states, and the transitions as described in the Supplementary Section 1. Each transition from state to state , , is associated with a transition probability . For every pair of states, and , the transition probability if . Let us define the length of the aligned read, , as . Second, it calculates the forward and backward probabilities of each state based on the aligned read, .
Let us assume that the forward probability of a state that observes the base of the aligned read, , is . For the forward probability, observing the base at the state means that all the previous bases ( and ) have been observed by following a path starting from the start state to the state and observes the next base, . All possible transitions that lead to state to observe the base contribute to the probability with (1) the forward probability of the origin state calculated with the base of , , (2) multiplied by the probability of the transition from to , , (3) multiplied by the probability of emitting the base at state , .
Let us denote the start state with the index value of [math] (i.e., ). For each state , we calculate the forward probability, , as follows where is the initialization step:
[TABLE]
[TABLE]
Let us assume that the backward probability of a state that observes base of the aligned read, , is . For the backward probability, observing the base at the state means that all the further bases ( and ) have been observed by following a path starting from the end state to the state (backwards) and observes the previous base, . All possible transitions that lead to state to observe the base contribute to the probability with (1) the backward probability of the next state calculated with the base of , , (2) multiplied by the probability of the transition from to , , (3) multiplied by the probability of emitting the base at state , .
Let us denote the end state with the index value of (i.e., ). For each state , we calculate the backward probability, , as follows where is the initialization step:
[TABLE]
[TABLE]
The calculations of forward and backward probabilities are referred as the Forward-Backward algorithm. After calculation of the forward and backward probabilities, Apollo uses the Baum-Welch algorithm to train the pHMM-graph by calculating the posterior transition and the emission probabilities of the sub-graph, , as shown in equations S4 and S5, respectively. In equation S4, we use the Iversonian brackets [4] to denote that is if the character of is the same character as . Otherwise, is [math]. This structure helps us to perform the summation in the numerator only when the character at a position equals to the character given in function (i.e., ). We, then, normalize this summation to make sure the sum of the emission probabilities that state can have is equal to 1.
[TABLE]
[TABLE]
3 Joining Posterior Probabilities
As we explain in the Supplementary Section 2, for each read that aligns to the contig, Apollo extracts a sub-graph and uses the Forward-Backward algorithm to train the sub-graph. It is highly possible that there can be overlaps between two or many sub-graphs such that the sub-graphs can include the same states and the transitions when using high coverage reads. However, the updates on the overlapping states and the transitions are exclusive between the sub-graphs such that no two update in separate graphs affect each other while calculating the Forward or the Backward probabilities. Each sub-graph uses the initial probabilities to calculate the posterior probabilities. In order to handle training of the overlapping states and the transitions, Apollo takes the average of the posterior probabilities and reports the average probability as the final posterior probability for the entire pHMM-graph.
Let us assume that the set of sub-graphs includes the same state . For each in , we obtain a , where , which denotes the posterior emission probability as we explain in the Supplementary Section 2. We denote that belongs to as . Then, Apollo finds the final emission value as follows:
[TABLE]
Similarly, let us assume that the set of sub-graphs includes the same transition edge . For each in , we obtain an that denotes the posterior transition value. We define that belongs to as . Apollo finds the final transition value as follows:
[TABLE]
If a state in or an edge in is not covered by a read then Apollo retains the initial emission and transition probabilities and uses as posterior probabilities, respectively.
We would like to note that the Baum-Welch algorithm is also used to train conventional hidden Markov models (HMMs). In each observation, the Baum-Welch algorithm updates the transition and emission probabilities of an HMM accordingly. The initial probabilities of such HMMs may even be assigned randomly. This means that the order of the observations (i.e., training data), and the initial probabilities used to train an HMM also affect the overall accuracy as the following observations usually use the HMM that is trained based on earlier observations. Therefore, after using all the training data, an HMM may still have room to converge to a local optimal point due to the biases caused by the initial probabilities and the order of the training data. The usual approach to mitigate such biases is to train HMMs multiple times until the overall accuracy of an HMM converges to a certain point. We do not follow this strategy because of three reasons. First, Apollo does not set the initial transition and emission probabilities randomly. Instead, the probabilities are usually set according to the error profile of an assembly. Second, we use the initial probabilities each time a read is used to train the pHMM-graph so the order of the training data does not matter. Third, Apollo is a very time consuming tool and taking multiple iterations until convergence would significantly increase the overall runtime, which we want to avoid.
4 Decoding with the Viterbi Algorithm
Apollo uses the Viterbi algorithm [5] to reveal the polished assembly by finding the most likely path starting from the start state, , of the trained graph to the end state, . For each state , the Viterbi algorithm calculates , which is the maximum marginal forward probability obtained from following a path starting from the start state when decoding the base of the polished contig. Let be the base that has the greatest emission probability for the state , i.e., , . Then, the value of depends on 1) the transition probability from state to the state , , 2) the Viterbi value of the state when decoding the base of the polished contig, , and 3) the emission probability of the base , . The Viterbi algorithm also keeps a back pointer, , which keeps track of the predecessor state that yields the value.
Let be the length of the decoded sequence, which is initially unknown. The algorithm recursively calculates values for each position of a decoded sequence as described in the equations S8.1 and S8.3. The algorithm stops at iteration such that for the last iterations, the maximum value we have observed for cannot be improved and is set to 50 by default (empirically chosen). is then set to such that is the maximum among all iterations .
Initialization
[TABLE]
[TABLE] 2. 2.
Recursion
[TABLE]
[TABLE] 3. 3.
Termination
[TABLE]
[TABLE]
The polished contig is generated by recursively following states from the end state, , at time until the back pointer points back to the start state, , at time for the state as follows:
5 Performance of the Assembly Polishing Algorithms
In Tables S1, S2, S3, S6 S9, and S12, we compare the assembly polishing performance of Apollo to the competing algorithms based on the difference between the assemblies and their reference genomes (i.e., ground truth). In Tables S4, S7, S10, S13, and S15, we show the k-mer similarities between Illumina reads and the assemblies to provide an alignment-free comparison between the tools. We also use QUAST [6] to make a more detailed quality assessment of the assemblies in Tables S5, S8, S11, S14, and S16.
6 Performance of the Aligners
Here in Table S17, we show the performances of the aligners in terms of number of alignments that the aligners generate given the assembly and the reads to align, runtime (wall clock), and the memory requirement.
7 Robustness of Apollo
Here in Tables S18, S19, S20, S21, we show the robustness of Apollo based on the parameters that has a direct affect on the machine learning algorithm. In each of the tables we show that Apollo is robust to different set of parameters.
8 Parameters
We show the parameter settings of the aligners that we used to align the reads to the assembly in Table S22.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Can Firtina, Ziv Bar-Joseph, Can Alkan, and A Ercument Cicek. Hercules: a profile HMM-based hybrid error correction algorithm for long reads. Nucleic Acids Research , 46(21):e 125–e 125, August 2018.
- 2[2] Sean R. Eddy. Profile hidden Markov models. Bioinformatics , 14(9):755–763, October 1998.
- 3[3] L. E. Baum. An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities , 3:1–8, 1972.
- 4[4] Donald E. Knuth. Two Notes on Notation. The American Mathematical Monthly , 99(5):403, May 1992.
- 5[5] A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory , 13(2):260–269, April 1967.
- 6[6] Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. QUAST: quality assessment tool for genome assemblies. Bioinformatics , 29(8):1072–1075, April 2013.
- 7[7] Sergey Koren, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, and Adam M. Phillippy. Canu: scalable and accurate long-read assembly via adaptive k -mer weighting and repeat separation. Genome Research , 27(5):722–736, May 2017.
- 8[8] Heng Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics , 32(14):2103–2110, July 2016.
