Consensus Sequence Segmentation

Tamal Chowdhury; Rabindra Rakshit; Arko Banerjee

arXiv:1308.3839·cs.CL·December 31, 2013

Consensus Sequence Segmentation

Tamal Chowdhury, Rabindra Rakshit, Arko Banerjee

PDF

Open Access

TL;DR

This paper presents a linear-time unsupervised algorithm for segmenting sequences into words or phrases based solely on statistical relationships, outperforming previous methods on various benchmarks.

Contribution

The paper introduces a novel, efficient unsupervised segmentation algorithm that does not require prior lexicon knowledge, advancing sequence segmentation techniques.

Findings

01

Superior segmentation accuracy over previous methods

02

Operates in linear time, suitable for large sequences

03

Effective without prior lexicon or supervised data

Abstract

In this paper we introduce a method to detect words or phrases in a given sequence of alphabets without knowing the lexicon. Our linear time unsupervised algorithm relies entirely on statistical relationships among alphabets in the input sequence to detect location of word boundaries. We compare our algorithm to previous approaches from unsupervised sequence segmentation literature and provide superior segmentation over number of benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Fractal and DNA sequence analysis · Genomics and Phylogenetic Studies