A new distance based on minimal absent words and applications to biological sequences
Giuseppa Castiglione, Jia Gao, Sabrina Mantaci, Antonio Restivo

TL;DR
This paper introduces a novel distance measure based on minimal absent words for comparing biological sequences, demonstrating its effectiveness through experiments on genetic data from 11 species.
Contribution
It proposes a new distance metric based on a specific subset of minimal absent words, improving sequence comparison methods.
Findings
The new distance captures sequence features more effectively.
Experimental results show better differentiation among species.
The method outperforms existing minimal absent word-based distances.
Abstract
A minimal absent word of a sequence x, is a sequence yt hat is not a factorof x, but all of its proper factors are factors of x as well. The set of minimal absent words uniquely defines the sequence itself. In recent times minimal absent words have been used in order to compare sequences. In fact, to do this, one can compare the sets of their minimal absent words. Chairungasee and Crochemorein [2] define a distance between pairs of sequences x and y, where the symmetric difference of the sets of minimal absent words of x and y is involved. Here, weconsider a different distance, introduced in [1], based on a specific subset of such symmetric difference that, in our opinion, better capture the different features ofthe considered sequences. We show the result of some experiments where the distance is tested on a dataset of genetic sequences by 11 living species, in order to compare the new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Algorithms and Data Compression · RNA and protein synthesis mechanisms
