Lost in Space Marking
Cassandra L. Jacobs, Yuval Pinter

TL;DR
This paper investigates how to mark subword tokens in tokenizers, finding that marking word-initial tokens is better for pre-tokenized English, while marking word ends is preferable for raw text, with results consistent across domains.
Contribution
It provides a comparative analysis of marking strategies in Unigram LM tokenizers, highlighting the impact of pre-tokenization and raw text on marking choices.
Findings
Marking word-initial tokens improves tokenizer performance on pre-tokenized text.
Marking word ends benefits tokenizers trained on raw text.
Findings are consistent across different domains.
Abstract
We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one. Based on surface-level considerations of efficiency and cohesion, as well as morphological coverage, we find that a Unigram LM tokenizer trained on pre-tokenized English text is better off marking the word-initial token, while one trained on raw text benefits from marking word ends. Our findings generalize across domains.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Second Language Acquisition and Learning · Text Readability and Simplification
