Lost in Space Marking

Cassandra L. Jacobs; Yuval Pinter

arXiv:2208.01561·cs.CL·August 3, 2022

Lost in Space Marking

Cassandra L. Jacobs, Yuval Pinter

PDF

Open Access

TL;DR

This paper investigates how to mark subword tokens in tokenizers, finding that marking word-initial tokens is better for pre-tokenized English, while marking word ends is preferable for raw text, with results consistent across domains.

Contribution

It provides a comparative analysis of marking strategies in Unigram LM tokenizers, highlighting the impact of pre-tokenization and raw text on marking choices.

Findings

01

Marking word-initial tokens improves tokenizer performance on pre-tokenized text.

02

Marking word ends benefits tokenizers trained on raw text.

03

Findings are consistent across different domains.

Abstract

We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one. Based on surface-level considerations of efficiency and cohesion, as well as morphological coverage, we find that a Unigram LM tokenizer trained on pre-tokenized English text is better off marking the word-initial token, while one trained on raw text benefits from marking word ends. Our findings generalize across domains.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Second Language Acquisition and Learning · Text Readability and Simplification