Unsupervised Morphological Tree Tokenizer
Qingyang Zhu, Xiang Hu, Pengyu Ji, Wei Wu, Kewei Tu

TL;DR
This paper presents an unsupervised deep model that induces morphological structures within words to improve tokenization, outperforming traditional methods like BPE and WordPiece in segmentation and language modeling tasks.
Contribution
It introduces a novel deep model with MorphOverriding for unsupervised morphological structure induction, enhancing tokenization quality without annotated data.
Findings
Outperforms BPE and WordPiece in segmentation tasks
Effectively retains complete morphemes during tokenization
Improves language modeling performance
Abstract
As a cornerstone in language modeling, tokenization involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby corrupting semantic information. To address this drawback, we introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words. Specifically, the deep model jointly encodes internal structures and representations of words with a mechanism named to ensure the indecomposability of morphemes. By training the model with self-supervised objectives, our method is capable of inducing character-level structures that align with morphological rules without annotated training data. Based on the induced structures, our algorithm tokenizes words through vocabulary matching in a top-down manner.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications
MethodsALIGN · Byte Pair Encoding · WordPiece
