Problematic Tokens: Tokenizer Bias in Large Language Models
Jin Yang, Zhiqiang Wang, Yanbin Lin, Zunduo Zhao

TL;DR
This paper investigates how tokenization biases in large language models like GPT-4o contribute to disparities in language processing, highlighting security and ethical concerns stemming from inadequate token vocabularies for under-resourced languages.
Contribution
It analyzes the tokenization process of GPT-4o, revealing how token vocabulary construction leads to biases and proposing strategies to improve tokenization for fairness and security.
Findings
Tokenization biases affect non-English language performance.
Simplified token handling amplifies security and ethical risks.
Proposed solutions mitigate biases and improve tokenization robustness.
Abstract
Recent advancements in large language models(LLMs), such as GPT-4 and GPT-4o, have shown exceptional performance, especially in languages with abundant resources like English, thanks to extensive datasets that ensure robust training. Conversely, these models exhibit limitations when processing under-resourced languages such as Chinese and Korean, where issues including hallucinatory responses remain prevalent. This paper traces the roots of these disparities to the tokenization process inherent to these models. Specifically, it explores how the tokenizers vocabulary, often used to speed up the tokenization process and reduce tokens but constructed independently of the actual model training data, inadequately represents non-English languages. This misrepresentation results in the propagation of under-trained or untrained tokens, which perpetuate biases and pose serious concerns related…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Residual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention
