DNAHLM -- DNA sequence and Human Language mixed large language Model
Wang Liang

TL;DR
This paper presents a novel large language model that integrates DNA sequences and human language, enabling advanced applications like prompt engineering and zero-shot prediction in DNA analysis.
Contribution
It introduces a unified model trained on both DNA and English text, facilitating multi-task learning and innovative applications for DNA sequence analysis.
Findings
Effective zero-shot DNA prediction demonstrated
Multi-task capabilities achieved with instruction fine-tuning
Unified DNA-human language model shows promising potential
Abstract
There are already many DNA large language models, but most of them still follow traditional uses, such as extracting sequence features for classification tasks. More innovative applications of large language models, such as prompt engineering, RAG, and zero-shot or few-shot prediction, remain challenging for DNA-based models. The key issue lies in the fact that DNA models and human natural language models are entirely separate; however, techniques like prompt engineering require the use of natural language, thereby significantly limiting the application of DNA large language models. This paper introduces a pre-trained model trained on the GPT-2 network, combining DNA sequences and English text, and uses a unified BPE tokenization method. We then convert classification and other downstream tasks into Alpaca format instruction data, and perform instruction fine-tuning on this pre-trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · BART · WordPiece · BERT · Dropout · Byte Pair Encoding · RAG · Dense Connections
