Comparison of machine learning and deep learning techniques in promoter   prediction across diverse species

Nikita Bhandari; Satyajeet Khare; Rahee Walambe; Ketan Kotecha

arXiv:2105.07659·q-bio.GN·May 18, 2021

Comparison of machine learning and deep learning techniques in promoter prediction across diverse species

Nikita Bhandari, Satyajeet Khare, Rahee Walambe, Ketan Kotecha

PDF

TL;DR

This study compares machine learning and deep learning methods for promoter prediction across yeast, plant, and human genomes, highlighting the effectiveness of CNN and frequency-based tokenization for efficient classification.

Contribution

It introduces a novel combination of synthetic shuffled negative datasets and frequency-based tokenization, providing a versatile framework for genomic classification tasks.

Findings

01

FBT reduces training time without loss of accuracy

02

CNN outperforms other models in promoter classification

03

Frequency-based tokenization improves data processing efficiency

Abstract

Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promoter sequences do not show explicit features, the prediction performance of these techniques is poor. Therefore, many machine learning and deep learning models have been proposed for promoter prediction. In this work, we studied methods for vector encoding and promoter classification using genome sequences of three distinct higher eukaryotes viz. yeast (Saccharomyces cerevisiae), A. thaliana (plant) and human (Homo sapiens). We compared one-hot vector encoding method with frequency-based tokenization (FBT) for data pre-processing on 1-D Convolutional Neural Network (CNN) model. We found that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.