A Novel Scalable Apache Spark Based Feature Extraction Approaches for Huge Protein Sequence and their Clustering Performance Analysis
Preeti Jha, Aruna Tiwari, Neha Bharill, Milind Ratnaparkhe, Om Prakash, Patel, Nilagiri Harshith, Mukkamalla Mounika, Neha Nagendra

TL;DR
This paper introduces two scalable Apache Spark-based feature extraction methods for high-dimensional protein sequences, enabling efficient clustering with improved accuracy demonstrated through extensive soybean dataset experiments.
Contribution
The paper presents novel scalable feature extraction approaches, 60d-SPF and 6d-SCPSF, specifically designed for large protein datasets, improving clustering performance over existing methods.
Findings
60d-SPF outperforms 6d-SCPSF in clustering quality
Proposed methods achieve higher Silhouette index scores
Methods demonstrate scalability on large datasets
Abstract
Genome sequencing projects are rapidly increasing the number of high-dimensional protein sequence datasets. Clustering a high-dimensional protein sequence dataset using traditional machine learning approaches poses many challenges. Many different feature extraction methods exist and are widely used. However, extracting features from millions of protein sequences becomes impractical because they are not scalable with current algorithms. Therefore, there is a need for an efficient feature extraction approach that extracts significant features. We have proposed two scalable feature extraction approaches for extracting features from huge protein sequences using Apache Spark, which are termed 60d-SPF (60-dimensional Scalable Protein Feature) and 6d-SCPSF (6-dimensional Scalable Co-occurrence-based Probability-Specific Feature). The proposed 60d-SPF and 6d-SCPSF approaches capture the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Probiotics and Fermented Foods · Gene expression and cancer classification
