Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword Tokenization
Zilong Li

TL;DR
This paper explores integrating morphological segmentation into subword tokenization using statistical and transformer-based methods, showing that morphological approaches can match traditional subword tokenizers in effectiveness.
Contribution
It introduces the use of morphological segmentation methods within subword tokenizers and analyzes their impact on language model performance.
Findings
Morphological segmentation can be as effective as standard subword tokenizers.
A balanced token vocabulary improves language model performance.
Frequent words as unique tokens enhance tokenizer effectiveness.
Abstract
This papers presents the submission of team Ryu to the canceled SIGMORPHON 2024 shared task on subword tokenization. My submission explores whether morphological segmentation methods can be used as a part of subword tokenizers. I adopt two approaches: the statistical segmentation method Morfessor and a transformer based sequence-to-sequence (seq2seq) segmentation model in tokenizers. The prediction results show that morphological segmentation could be as effective as commonly used subword tokenizers. Additionally, I investigate how a tokenizer's vocabulary influences the performance of language models. A tokenizer with a balanced token frequency distribution tends to work better. A balanced token vocabulary can be achieved by keeping frequent words as unique tokens.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
