TL;DR
This survey examines various BWT variants used for string collections in bioinformatics, highlighting their differences in theory and practice, and analyzing how these differences impact biological data processing.
Contribution
The paper systematically reviews 18 tools, identifies six BWT variants, and compares their theoretical and practical differences across multiple biological datasets.
Findings
Significant differences exist between BWT variants, especially on similar short sequences.
The number of BWT runs varies up to 4.2 times across variants.
Input order can affect the BWT output for many tools.
Abstract
In recent years, the focus of bioinformatics research has moved from individual sequences to collections of sequences. Given the fundamental role of the Burrows-Wheeler Transform (BWT) in string processing, a number of dedicated tools have been developed for computing the BWT of string collections. While the focus has been on improving efficiency, both in space and time, the exact definition of the BWT employed has not been at the center of attention. As we show in this paper, the different tools in use often compute non-equivalent BWT variants: the resulting transforms can differ from each other significantly, including the number of runs, a central parameter of the BWT. Moreover, with many tools, the transform depends on the input order of the collection. In other words, on the same dataset, the same tool may output different transforms if the dataset is given in a different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
