Human Genome Book: Words, Sentences and Paragraphs

Wang Liang

arXiv:2501.16982·q-bio.OT·January 29, 2025

Human Genome Book: Words, Sentences and Paragraphs

Wang Liang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel approach to interpret the human genome as a book by constructing words, sentences, and paragraphs using transfer learning from language models, enabling new tools for genomic analysis.

Contribution

It develops a foundational model that transfers linguistic capabilities from English to DNA, creating a vocabulary and segmentation methods for genomic sequences.

Findings

01

Successfully mapped DNA sequences to English vocabulary

02

Segmented the human genome into words, sentences, and paragraphs

03

Created an English version of the genomic book

Abstract

Since the completion of the human genome sequencing project in 2001, significant progress has been made in areas such as gene regulation editing and protein structure prediction. However, given the vast amount of genomic data, the segments that can be fully annotated and understood remain relatively limited. If we consider the genome as a book, constructing its equivalents of words, sentences, and paragraphs has been a long-standing and popular research direction. Recently, studies on transfer learning in large language models have provided a novel approach to this challenge.Multilingual transfer ability, which assesses how well models fine-tuned on a source language can be applied to other languages, has been extensively studied in multilingual pre-trained models. Similarly, the transfer of natural language capabilities to "DNA language" has also been validated. Building upon these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

maris205/genome_book
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDiverse Musicological Studies