Unsupervised Morphological Tree Tokenizer

Qingyang Zhu; Xiang Hu; Pengyu Ji; Wei Wu; Kewei Tu

arXiv:2406.15245·cs.CL·July 11, 2025

Unsupervised Morphological Tree Tokenizer

Qingyang Zhu, Xiang Hu, Pengyu Ji, Wei Wu, Kewei Tu

PDF

Open Access 1 Video

TL;DR

This paper presents an unsupervised deep model that induces morphological structures within words to improve tokenization, outperforming traditional methods like BPE and WordPiece in segmentation and language modeling tasks.

Contribution

It introduces a novel deep model with MorphOverriding for unsupervised morphological structure induction, enhancing tokenization quality without annotated data.

Findings

01

Outperforms BPE and WordPiece in segmentation tasks

02

Effectively retains complete morphemes during tokenization

03

Improves language modeling performance

Abstract

As a cornerstone in language modeling, tokenization involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby corrupting semantic information. To address this drawback, we introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words. Specifically, the deep model jointly encodes internal structures and representations of words with a mechanism named $MorphOverriding$ to ensure the indecomposability of morphemes. By training the model with self-supervised objectives, our method is capable of inducing character-level structures that align with morphological rules without annotated training data. Based on the induced structures, our algorithm tokenizes words through vocabulary matching in a top-down manner.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Unsupervised Morphological Tree Tokenizer· underline

Taxonomy

TopicsNeural Networks and Applications

MethodsALIGN · Byte Pair Encoding · WordPiece