Detecting "protein words" through unsupervised word segmentation
Wang Liang, Zhao KaiYong

TL;DR
This paper applies unsupervised word segmentation to protein sequences, revealing that it efficiently captures information and may identify 'protein ruins' in noncoding regions, offering a novel approach to protein analysis.
Contribution
It introduces the use of unsupervised word segmentation for analyzing protein sequences, demonstrating its efficiency over secondary structure segmentation and uncovering potential noncoding region features.
Findings
Unsupervised segmentation is more efficient than secondary structure segmentation.
Segmentation reveals 'protein ruins' in noncoding regions.
Method provides a new perspective for protein sequence analysis.
Abstract
Unsupervised word segmentation methods were applied to analyze the protein sequence. Protein sequences, such as 'MTMDKSELVQKA...', were used as input to these methods. Segmented 'protein word' sequences, such as 'MTM DKSE LVQKA', were then obtained. We compare the 'protein words' produced by unsupervised segmentation and the protein secondary structure segmentation. An interesting finding is that the unsupervised word segmentation is more efficient than secondary structure segmentation in expressing information. Our experiment also suggests there may be some 'protein ruins' in current noncoding regions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Natural Language Processing Techniques · Topic Modeling
