A mapping-free NLP-based technique for sequence search in Nanopore long-reads
Tomasz Strzoda, Lourdes Cruz-Garcia, Mustafa Najim, Christophe Badie,, Joanna Polanska

TL;DR
This paper introduces an NLP-based sequence identification method for Nanopore long-reads that can replace classical mapping techniques, offering high accuracy and efficiency suitable for emergency scenarios.
Contribution
The study presents a novel NLP approach for sequence search in Nanopore reads, demonstrating high accuracy and robustness, and showing potential for rapid, mapping-free gene identification in critical situations.
Findings
Achieved 98.29% balanced accuracy and 99.25% NPV for FDXR gene identification.
NLP model validated on external dataset with 99.64% NPV for complete dictionary.
Reduced dictionary size maintained high accuracy with 96.49% BACC and 98.15% NPV.
Abstract
In unforeseen situations, such as nuclear power plant's or civilian radiation accidents, there is a need for effective and computationally inexpensive methods to determine the expression level of a selected gene panel, allowing for rough dose estimates in thousands of donors. The new generation in-situ mapper, fast and of low energy consumption, working at the level of single nanopore output, is in demand. We aim to create a sequence identification tool that utilizes Natural Language Processing (NLP) techniques and ensures a high level of negative predictive value (NPV) compared to the classical approach. The training dataset consisted of RNASeq data from 6 samples. Having tested multiple NLP models, the best configuration analyses the entire sequence and uses a word length of 3 base pairs with one-word neighbor on each side. For the considered FDXR gene, the achieved mean balanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies
MethodsBalanced Selection
