Comparison of machine learning and deep learning techniques in promoter prediction across diverse species
Nikita Bhandari, Satyajeet Khare, Rahee Walambe, Ketan Kotecha

TL;DR
This study compares machine learning and deep learning methods for promoter prediction across yeast, plant, and human genomes, highlighting the effectiveness of CNN and frequency-based tokenization for efficient classification.
Contribution
It introduces a novel combination of synthetic shuffled negative datasets and frequency-based tokenization, providing a versatile framework for genomic classification tasks.
Findings
FBT reduces training time without loss of accuracy
CNN outperforms other models in promoter classification
Frequency-based tokenization improves data processing efficiency
Abstract
Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promoter sequences do not show explicit features, the prediction performance of these techniques is poor. Therefore, many machine learning and deep learning models have been proposed for promoter prediction. In this work, we studied methods for vector encoding and promoter classification using genome sequences of three distinct higher eukaryotes viz. yeast (Saccharomyces cerevisiae), A. thaliana (plant) and human (Homo sapiens). We compared one-hot vector encoding method with frequency-based tokenization (FBT) for data pre-processing on 1-D Convolutional Neural Network (CNN) model. We found that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
