Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge
Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei, Delgerbaatar, Omri Uzan, Yuval Pinter, G\'abor Bella

TL;DR
This paper introduces a new evaluation framework for subword tokenization, combining intrinsic morphological classification with extrinsic OOV generalization tests, revealing that alien tokenization hampers model performance.
Contribution
It proposes a novel combined intrinsic-extrinsic evaluation framework for subword tokenization, including a new tool and benchmark for assessing morphological versus alien tokenization effects.
Findings
UniMorph Labeller achieves 98% accuracy
Alien tokenization results in poorer OOV generalization
Morphological tokenization improves semantic compositionality
Abstract
The popular subword tokenizers of current language models, such as Byte-Pair Encoding (BPE), are known not to respect morpheme boundaries, which affects the downstream performance of the models. While many improved tokenization algorithms have been proposed, their evaluation and cross-comparison is still an open problem. As a solution, we propose a combined intrinsic-extrinsic evaluation framework for subword tokenization. Intrinsic evaluation is based on our new UniMorph Labeller tool that classifies subword tokenization as either morphological or alien. Extrinsic evaluation, in turn, is performed via the Out-of-Vocabulary Generalization Challenge 1.0 benchmark, which consists of three newly specified downstream text classification tasks. Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that, in all language models studied (including ALBERT, BERT, RoBERTa,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling
MethodsAttention Is All You Need · Weight Decay · Dense Connections · Residual Connection · Softmax · Adam · Linear Warmup With Linear Decay · Layer Normalization · Attention Dropout · Linear Layer
