Consensus Sequence Segmentation
Tamal Chowdhury, Rabindra Rakshit, Arko Banerjee

TL;DR
This paper presents a linear-time unsupervised algorithm for segmenting sequences into words or phrases based solely on statistical relationships, outperforming previous methods on various benchmarks.
Contribution
The paper introduces a novel, efficient unsupervised segmentation algorithm that does not require prior lexicon knowledge, advancing sequence segmentation techniques.
Findings
Superior segmentation accuracy over previous methods
Operates in linear time, suitable for large sequences
Effective without prior lexicon or supervised data
Abstract
In this paper we introduce a method to detect words or phrases in a given sequence of alphabets without knowing the lexicon. Our linear time unsupervised algorithm relies entirely on statistical relationships among alphabets in the input sequence to detect location of word boundaries. We compare our algorithm to previous approaches from unsupervised sequence segmentation literature and provide superior segmentation over number of benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Fractal and DNA sequence analysis · Genomics and Phylogenetic Studies
