Generative Language Models on Nucleotide Sequences of Human Genes
Musa Nuri Ihtiyar, Arzucan Ozgur

TL;DR
This study explores the development and evaluation of autoregressive generative language models for human gene nucleotide sequences, revealing that RNNs perform best and that minimal vocabulary size does not significantly reduce data needs.
Contribution
It is the first systematic exploration of generative language models on human gene sequences, comparing various models and analyzing data requirements.
Findings
RNNs outperform other models in generating DNA sequences
Simple N-gram models show promising results
Minimal vocabulary size does not significantly reduce data requirements
Abstract
Language models, especially transformer-based ones, have achieved colossal success in NLP. To be precise, studies like BERT for NLU and works like GPT-3 for NLG are very important. If we consider DNA sequences as a text written with an alphabet of four letters representing the nucleotides, they are similar in structure to natural languages. This similarity has led to the development of discriminative language models such as DNABert in the field of DNA-related bioinformatics. To our knowledge, however, the generative side of the coin is still largely unexplored. Therefore, we have focused on the development of an autoregressive generative language model such as GPT-3 for DNA sequences. Since working with whole DNA sequences is challenging without extensive computational resources, we decided to conduct our study on a smaller scale and focus on nucleotide sequences of human genes rather…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning in Bioinformatics
Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Linear Layer · Softmax · Dense Connections · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia?
