Detecting "protein words" through unsupervised word segmentation

Wang Liang; Zhao KaiYong

arXiv:1404.6866·cs.CE·October 29, 2015

Detecting "protein words" through unsupervised word segmentation

Wang Liang, Zhao KaiYong

PDF

Open Access 1 Repo

TL;DR

This paper applies unsupervised word segmentation to protein sequences, revealing that it efficiently captures information and may identify 'protein ruins' in noncoding regions, offering a novel approach to protein analysis.

Contribution

It introduces the use of unsupervised word segmentation for analyzing protein sequences, demonstrating its efficiency over secondary structure segmentation and uncovering potential noncoding region features.

Findings

01

Unsupervised segmentation is more efficient than secondary structure segmentation.

02

Segmentation reveals 'protein ruins' in noncoding regions.

03

Method provides a new perspective for protein sequence analysis.

Abstract

Unsupervised word segmentation methods were applied to analyze the protein sequence. Protein sequences, such as 'MTMDKSELVQKA...', were used as input to these methods. Segmented 'protein word' sequences, such as 'MTM DKSE LVQKA', were then obtained. We compare the 'protein words' produced by unsupervised segmentation and the protein secondary structure segmentation. An interesting finding is that the unsupervised word segmentation is more efficient than secondary structure segmentation in expressing information. Our experiment also suggests there may be some 'protein ruins' in current noncoding regions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

maris205/secondary_structure_detection
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Bioinformatics · Natural Language Processing Techniques · Topic Modeling