Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword   Tokenization

Zilong Li

arXiv:2410.17094·cs.CL·October 23, 2024

Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword Tokenization

Zilong Li

PDF

Open Access

TL;DR

This paper explores integrating morphological segmentation into subword tokenization using statistical and transformer-based methods, showing that morphological approaches can match traditional subword tokenizers in effectiveness.

Contribution

It introduces the use of morphological segmentation methods within subword tokenizers and analyzes their impact on language model performance.

Findings

01

Morphological segmentation can be as effective as standard subword tokenizers.

02

A balanced token vocabulary improves language model performance.

03

Frequent words as unique tokens enhance tokenizer effectiveness.

Abstract

This papers presents the submission of team Ryu to the canceled SIGMORPHON 2024 shared task on subword tokenization. My submission explores whether morphological segmentation methods can be used as a part of subword tokenizers. I adopt two approaches: the statistical segmentation method Morfessor and a transformer based sequence-to-sequence (seq2seq) segmentation model in tokenizers. The prediction results show that morphological segmentation could be as effective as commonly used subword tokenizers. Additionally, I investigate how a tokenizer's vocabulary influences the performance of language models. A tokenizer with a balanced token frequency distribution tends to work better. A balanced token vocabulary can be achieved by keeping frequent words as unique tokens.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques