Natural language processing for clusterization of genes according to   their functions

Vladislav Dordiuk; Ekaterina Demicheva; Fernando Polanco Espino,; Konstantin Ushenin

arXiv:2207.08162·cs.CL·August 28, 2023

Natural language processing for clusterization of genes according to their functions

Vladislav Dordiuk, Ekaterina Demicheva, Fernando Polanco Espino,, Konstantin Ushenin

PDF

Open Access

TL;DR

This paper presents a novel NLP-based method for clustering thousands of genes by their functions, using language models and dimensionality reduction to improve gene analysis in mRNA-sequencing data.

Contribution

It introduces a pipeline that combines open database enrichment, pretrained language models, and clustering techniques for large-scale gene function analysis.

Findings

01

Identified the most effective pipeline among 180 tested configurations.

02

Demonstrated improved clustering performance with NLP-based encoding.

03

Validated results through expert review and clustering indexes.

Abstract

There are hundreds of methods for analysis of data obtained in mRNA-sequencing. The most of them are focused on small number of genes. In this study, we propose an approach that reduces the analysis of several thousand genes to analysis of several clusters. The list of genes is enriched with information from open databases. Then, the descriptions are encoded as vectors using the pretrained language model (BERT) and some text processing approaches. The encoded gene function pass through the dimensionality reduction and clusterization. Aiming to find the most efficient pipeline, 180 cases of pipeline with different methods in the major pipeline steps were analyzed. The performance was evaluated with clusterization indexes and expert review of the results.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Genomics and Phylogenetic Studies · Natural Language Processing Techniques