NLP-based classification of software tools for metagenomics sequencing data analysis into EDAM semantic annotation
Kaoutar Daoud Hiri, Matja\v{z} Hren, Toma\v{z} Curk

TL;DR
This paper presents a machine learning-based system to classify metagenomics software tools into semantic categories, aiding researchers in selecting appropriate tools for data analysis pipelines.
Contribution
It introduces a novel classification approach using NLP and machine learning to categorize metagenomics tools into EDAM semantic annotations, improving pipeline construction.
Findings
Achieved an AUPRC of 0.85 with Logistic Regression and BioBERT.
Used 224 curated tools with text from abstracts and methods.
Identified optimal features and models for accurate classification.
Abstract
Motivation: The rapid growth of metagenomics sequencing data makes metagenomics increasingly dependent on computational and statistical methods for fast and efficient analysis. Consequently, novel analysis tools for big-data metagenomics are constantly emerging. One of the biggest challenges for researchers occurs in the analysis planning stage: selecting the most suitable metagenomics software tool to gain valuable insights from sequencing data. The building process of data analysis pipelines is often laborious and time-consuming since it requires a deep and critical understanding of how to apply a particular tool to complete a specified metagenomics task. Results: We have addressed this challenge by using machine learning methods to develop a classification system of metagenomics software tools into 13 classes (11 semantic annotations of EDAM and two virus-specific classes) based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Gene expression and cancer classification · Machine Learning in Bioinformatics
MethodsGloVe Embeddings · Logistic Regression
