Between words and characters: A Brief History of Open-Vocabulary   Modeling and Tokenization in NLP

Sabrina J. Mielke; Zaid Alyafeai; Elizabeth Salesky; Colin Raffel,; Manan Dey; Matthias Gall\'e; Arun Raja; Chenglei Si; Wilson Y. Lee; Beno\^it; Sagot; Samson Tan

arXiv:2112.10508·cs.CL·December 21, 2021·104 cites

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

Sabrina J. Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel,, Manan Dey, Matthias Gall\'e, Arun Raja, Chenglei Si, Wilson Y. Lee, Beno\^it, Sagot, Samson Tan

PDF

Open Access

TL;DR

This paper reviews the evolution of text units in NLP, highlighting the shift from word-based models to subword and byte-level approaches, emphasizing the ongoing importance of tokenization.

Contribution

It provides a comprehensive history and analysis of open-vocabulary modeling and tokenization techniques in NLP, connecting pre-neural and neural methods.

Findings

01

Subword approaches enable small vocabularies and fast inference.

02

Hybrid word-character models have been proposed and evaluated.

03

No single tokenization method is optimal for all applications.

Abstract

What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating those as discrete and atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in many areas, enabling small vocabularies while still allowing for fast inference. Is the end of the road character-level model or byte-level processing? In this survey, we connect several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated. We conclude that there is and likely will never be a silver bullet singular solution for all applications and that thinking seriously…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning in Bioinformatics