Amino acid frequency and domain features serve well for random forest based classification of thermophilic and mesophilic protein; a case study on serine proteases
Jithin S. Sunny, Lilly M. Saleena

TL;DR
This study developed a highly accurate, enzyme-specific random forest classifier for thermophilic versus mesophilic serine proteases using minimal features, with potential applications in protein engineering and a web tool for practical use.
Contribution
It introduces a novel, simplified feature-based machine learning model for enzyme classification that achieves high accuracy with fewer features than previous methods.
Findings
Achieved 95.71% classification accuracy.
Used significantly fewer features than prior models.
Developed a web platform for real-time enzyme classification.
Abstract
Thermostability is an important prerequisite for enzymes employed for industrial applications. Several machine learning based models have thus been formulated for protein classification based on this particular trait. These models have employed features derived from sequences, structures or both resulting in a >93% accuracy based on a 10-fold cross-validation. Besides using various proteins from a wide range of organisms, such studies also rely on hundreds of features. In the present study, an enzyme specific classification model was created using significantly less number of features that provides a similar accuracy of classification for thermophilic and non-thermophilic enzyme serine proteases. For building the classifier, 219 thermophilic and 200 mesophilic bacterial genomes were mined for their respective serine protease sequences. Features were extracted for 800 sequences followed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · Enzyme Production and Characterization
