Deciphering genomic codes using advanced NLP techniques: a scoping review
Shuyan Cheng, Yishu Wei, Yiliang Zhou, Zihan Xu, Drew N Wright, Jinze, Liu, Yifan Peng

TL;DR
This review explores how advanced NLP techniques, especially transformer-based models, are increasingly used to analyze complex genomic data, improving annotation prediction and understanding of genomic structures.
Contribution
It provides a comprehensive overview of recent NLP applications in genomics, highlighting the potential and current limitations of these methods in genomic code deciphering.
Findings
Tokenization and transformers improve genomic data processing
NLP models predict regulatory annotations like transcription-factor binding sites
NLP applications in genomics are promising but face challenges in transparency
Abstract
Objectives: The vast and complex nature of human genomic sequencing data presents challenges for effective analysis. This review aims to investigate the application of Natural Language Processing (NLP) techniques, particularly Large Language Models (LLMs) and transformer architectures, in deciphering genomic codes, focusing on tokenization, transformer models, and regulatory annotation prediction. The goal of this review is to assess data and model accessibility in the most recent literature, gaining a better understanding of the existing capabilities and constraints of these tools in processing genomic sequencing data. Methods: Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, our scoping review was conducted across PubMed, Medline, Scopus, Web of Science, Embase, and ACM Digital Library. Studies were included if they focused on NLP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLib
