Human Genome Book: Words, Sentences and Paragraphs
Wang Liang

TL;DR
This paper introduces a novel approach to interpret the human genome as a book by constructing words, sentences, and paragraphs using transfer learning from language models, enabling new tools for genomic analysis.
Contribution
It develops a foundational model that transfers linguistic capabilities from English to DNA, creating a vocabulary and segmentation methods for genomic sequences.
Findings
Successfully mapped DNA sequences to English vocabulary
Segmented the human genome into words, sentences, and paragraphs
Created an English version of the genomic book
Abstract
Since the completion of the human genome sequencing project in 2001, significant progress has been made in areas such as gene regulation editing and protein structure prediction. However, given the vast amount of genomic data, the segments that can be fully annotated and understood remain relatively limited. If we consider the genome as a book, constructing its equivalents of words, sentences, and paragraphs has been a long-standing and popular research direction. Recently, studies on transfer learning in large language models have provided a novel approach to this challenge.Multilingual transfer ability, which assesses how well models fine-tuned on a source language can be applied to other languages, has been extensively studied in multilingual pre-trained models. Similarly, the transfer of natural language capabilities to "DNA language" has also been validated. Building upon these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDiverse Musicological Studies
