SpaceByte: Towards Deleting Tokenization from Large Language Modeling

Kevin Slagle

arXiv:2404.14408·cs.CL·October 8, 2024·1 cites

SpaceByte: Towards Deleting Tokenization from Large Language Modeling

Kevin Slagle

PDF

Open Access 1 Repo 1 Video

TL;DR

SpaceByte introduces a byte-level language model with larger transformer blocks to eliminate tokenization disadvantages while maintaining comparable performance to tokenized models.

Contribution

It proposes a novel byte-level Transformer architecture with larger blocks after specific bytes, reducing tokenization issues without sacrificing performance.

Findings

01

SpaceByte outperforms other byte-level models.

02

It matches the performance of tokenized Transformer models.

03

The architecture reduces tokenization-related disadvantages.

Abstract

Tokenization is widely used in large language models because it significantly improves performance. However, tokenization imposes several disadvantages, such as performance biases, increased adversarial vulnerability, decreased character-level modeling performance, and increased modeling complexity. To address these disadvantages without sacrificing performance, we propose SpaceByte, a novel byte-level decoder architecture that closes the performance gap between byte-level and subword autoregressive language modeling. SpaceByte consists of a byte-level Transformer model, but with extra larger transformer blocks inserted in the middle of the layers. We find that performance is significantly improved by applying these larger blocks only after certain bytes, such as space characters, which typically denote word boundaries. Our experiments show that for a fixed training and inference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kjslag/spacebyte
pytorchOfficial

Videos

SpaceByte: Towards Deleting Tokenization from Large Language Modeling· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Dropout · Dense Connections · Label Smoothing · Residual Connection · Softmax · Adam