Fast WordPiece Tokenization

Xinying Song; Alex Salcianu; Yang Song; Dave Dopson; Denny Zhou

arXiv:2012.15524·cs.CL·October 7, 2021

Fast WordPiece Tokenization

Xinying Song, Alex Salcianu, Yang Song, Dave Dopson, Denny Zhou

PDF

1 Repo

TL;DR

This paper introduces a novel, linear-time algorithm for WordPiece tokenization, significantly improving efficiency over existing methods, and demonstrates substantial speed gains in practical NLP preprocessing tasks.

Contribution

The paper presents a new O(n) algorithm for WordPiece tokenization inspired by Aho-Corasick, enabling faster tokenization for single words and general text.

Findings

01

8.2x faster than HuggingFace Tokenizers

02

5.1x faster than TensorFlow Text

03

Effective for both single-word and sentence tokenization

Abstract

Tokenization is a fundamental preprocessing step for almost all NLP tasks. In this paper, we propose efficient algorithms for the WordPiece tokenization used in BERT, from single-word tokenization to general text (e.g., sentence) tokenization. When tokenizing a single word, WordPiece uses a longest-match-first strategy, known as maximum matching. The best known algorithms so far are O(n^2) (where n is the input length) or O(nm) (where m is the maximum vocabulary token length). We propose a novel algorithm whose tokenization complexity is strictly O(n). Our method is inspired by the Aho-Corasick algorithm. We introduce additional linkages on top of the trie built from the vocabulary, allowing smart transitions when the trie matching cannot continue. For general text, we further propose an algorithm that combines pre-tokenization (splitting the text into words) and our linear-time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tensorflow/text/blob/master/docs/api_docs/python/text/FastWordpieceTokenizer.md
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Adam · Attention Dropout · Residual Connection · Weight Decay · Dropout · Dense Connections · Softmax · Linear Warmup With Linear Decay