The Power of Word-Frequency Based Alignment-Free Functions: a Comprehensive Large-scale Experimental Analysis -- Version 3
Giuseppe Cattaneo, Umberto Ferraro Petrillo, Raffaele Giancarlo,, Francesco Palini, Chiara Romualdi

TL;DR
This paper provides a comprehensive large-scale evaluation of alignment-free sequence comparison functions, focusing on their power and false positive control across diverse sequence lengths and genomic features, offering guidance for their application.
Contribution
It presents the first uniform assessment of AF functions' power and Type I error, identifying the most effective functions for various genomic analysis scenarios.
Findings
Four AF functions outperform others across different sequence lengths.
Most AF functions show consistent performance between short and long sequences.
The study offers a public platform for validating future AF functions.
Abstract
Motivation: Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e., their ability to identify true similarity, has been limited to some members of the D2 family by experimental studies on short sequences, not adequate for current applications, where sequence lengths may vary considerably. Such a State of the Art is methodologically problematic, since information regarding a key feature such as power is either missing or limited. Results: By concentrating on a representative set of word-frequency based AF functions, we perform the first coherent and uniform evaluation of the power, involving also Type I error for completeness. Two Alternative models of important…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Fractal and DNA sequence analysis · Genomics and Phylogenetic Studies
