Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
Sabrina J. Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel,, Manan Dey, Matthias Gall\'e, Arun Raja, Chenglei Si, Wilson Y. Lee, Beno\^it, Sagot, Samson Tan

TL;DR
This paper reviews the evolution of text units in NLP, highlighting the shift from word-based models to subword and byte-level approaches, emphasizing the ongoing importance of tokenization.
Contribution
It provides a comprehensive history and analysis of open-vocabulary modeling and tokenization techniques in NLP, connecting pre-neural and neural methods.
Findings
Subword approaches enable small vocabularies and fast inference.
Hybrid word-character models have been proposed and evaluated.
No single tokenization method is optimal for all applications.
Abstract
What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating those as discrete and atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in many areas, enabling small vocabularies while still allowing for fast inference. Is the end of the road character-level model or byte-level processing? In this survey, we connect several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated. We conclude that there is and likely will never be a silver bullet singular solution for all applications and that thinking seriously…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning in Bioinformatics
