deepFEPS: Deep Learning-Oriented Feature Extraction for Biological Sequences
Hamid Ismail, Marwan Bikdash

TL;DR
deepFEPS is an open-source toolkit that unifies multiple modern feature extraction methods for biological sequences, simplifying preprocessing and enhancing reproducibility for machine learning applications in bioinformatics.
Contribution
It integrates diverse sequence embedding techniques into a single, user-friendly platform, streamlining the process from raw data to analysis-ready features.
Findings
Reduces preprocessing complexity for biological sequence analysis.
Provides automatic quality-control reports for sequence datasets.
Enables both novice and expert users to generate advanced embeddings.
Abstract
Machine- and deep-learning approaches for biological sequences depend critically on transforming raw DNA, RNA, and protein FASTA files into informative numerical representations. However, this process is often fragmented across multiple libraries and preprocessing steps, which creates a barrier for researchers without extensive computational expertise. To address this gap, we developed deepFEPS, an open-source toolkit that unifies state-of-the-art feature extraction methods for sequence data within a single, reproducible workflow. deepFEPS integrates five families of modern feature extractors - k-mer embeddings (Word2Vec, FastText), document-level embeddings (Doc2Vec), transformer-based encoders (DNABERT, ProtBERT, and ESM2), autoencoder-derived latent features, and graph-based embeddings - into one consistent platform. The system accepts FASTA input via a web interface or command-line…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · Bioinformatics and Genomic Networks
