Problematic Tokens: Tokenizer Bias in Large Language Models

Jin Yang; Zhiqiang Wang; Yanbin Lin; Zunduo Zhao

arXiv:2406.11214·cs.CL·November 15, 2024

Problematic Tokens: Tokenizer Bias in Large Language Models

Jin Yang, Zhiqiang Wang, Yanbin Lin, Zunduo Zhao

PDF

Open Access 1 Repo

TL;DR

This paper investigates how tokenization biases in large language models like GPT-4o contribute to disparities in language processing, highlighting security and ethical concerns stemming from inadequate token vocabularies for under-resourced languages.

Contribution

It analyzes the tokenization process of GPT-4o, revealing how token vocabulary construction leads to biases and proposing strategies to improve tokenization for fairness and security.

Findings

01

Tokenization biases affect non-English language performance.

02

Simplified token handling amplifies security and ethical risks.

03

Proposed solutions mitigate biases and improve tokenization robustness.

Abstract

Recent advancements in large language models(LLMs), such as GPT-4 and GPT-4o, have shown exceptional performance, especially in languages with abundant resources like English, thanks to extensive datasets that ensure robust training. Conversely, these models exhibit limitations when processing under-resourced languages such as Chinese and Korean, where issues including hallucinatory responses remain prevalent. This paper traces the roots of these disparities to the tokenization process inherent to these models. Specifically, it explores how the tokenizers vocabulary, often used to speed up the tokenization process and reduce tokens but constructed independently of the actual model training data, inadequately represents non-English languages. This misrepresentation results in the propagation of under-trained or untrained tokens, which perpetuate biases and pose serious concerns related…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yeyimilk/llmgpt4o
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Residual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention