Unsupervised Tokenization Learning

Anton Kolonin; Vignav Ramesh

arXiv:2205.11443·cs.CL·December 16, 2022

Unsupervised Tokenization Learning

Anton Kolonin, Vignav Ramesh

PDF

Open Access

TL;DR

This study introduces an unsupervised tokenization method that leverages the 'transition freedom' metric, outperforming traditional statistical metrics across multiple languages, with performance influenced by language-specific adaptations and corpus size.

Contribution

The paper proposes a novel unsupervised tokenization approach based on 'transition freedom', demonstrating its effectiveness and language-specific adaptations compared to existing methods.

Findings

01

Transition freedom metric outperforms mutual information and conditional probability.

02

Different languages require different metric derivatives for optimal tokenization.

03

Larger corpora do not necessarily improve tokenization quality; model compression can enhance performance.

Abstract

In the presented study, we discover that the so-called "transition freedom" metric appears superior for unsupervised tokenization purposes in comparison to statistical metrics such as mutual information and conditional probability, providing F-measure scores in range from 0.71 to 1.0 across explored multilingual corpora. We find that different languages require different offshoots of that metric (such as derivative, variance, and "peak values") for successful tokenization. Larger training corpora do not necessarily result in better tokenization quality, while compressing the models by eliminating statistically weak evidence tends to improve performance. The proposed unsupervised tokenization technique provides quality better than or comparable to lexicon-based ones, depending on the language.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling