Tokenization Falling Short: On Subword Robustness in Large Language   Models

Yekun Chai; Yewei Fang; Qiwei Peng; Xuhong Li

arXiv:2406.11687·cs.CL·October 7, 2024

Tokenization Falling Short: On Subword Robustness in Large Language Models

Yekun Chai, Yewei Fang, Qiwei Peng, Xuhong Li

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper investigates the limitations of subword tokenization in large language models, highlighting their susceptibility to errors and variations, and explores methods like subword regularization to improve robustness.

Contribution

It systematically analyzes tokenization issues in LLMs and evaluates mitigation strategies such as subword regularization, providing new insights into improving model robustness.

Findings

01

Scaling models reduces tokenization biases

02

LLMs remain sensitive to typos and text variations

03

Subword regularization techniques can mitigate tokenization issues

Abstract

Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens--issues we term the curse of tokenization. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems. This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our experiments show that subword regularization such as BPE-dropout…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

floatai/tkeval
noneOfficial

Datasets

floatai/TKEval
dataset· 16 dl
16 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsUbiquitin and proteasome pathways · Microtubule and mitosis dynamics