Segmenting DNA sequence into `words'

Wang Liang

arXiv:1202.2518·q-bio.GN·March 13, 2015·2 cites

Segmenting DNA sequence into `words'

Wang Liang

PDF

Open Access

TL;DR

This paper introduces an unsupervised, n-gram based method for segmenting DNA sequences into words, identifying typical word lengths and providing a benchmark for evaluation.

Contribution

It presents a novel unsupervised segmentation approach for DNA sequences using statistical language models and establishes a benchmark for performance assessment.

Findings

01

Most DNA words are 12 to 15 base pairs long

02

The proposed method effectively segments DNA sequences

03

A benchmark for DNA segmentation methods is introduced

Abstract

This paper presents a novel method to segment/decode DNA sequences based on n-grams statistical language model. Firstly, we find the length of most DNA 'words' is 12 to 15 bps by analyzing the genomes of 12 model species. Then we design an unsupervised probability based approach to segment the DNA sequences. The benchmark of segmenting method is also proposed.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies · RNA and protein synthesis mechanisms · Algorithms and Data Compression