SpaceByte: Towards Deleting Tokenization from Large Language Modeling
Kevin Slagle

TL;DR
SpaceByte introduces a byte-level language model with larger transformer blocks to eliminate tokenization disadvantages while maintaining comparable performance to tokenized models.
Contribution
It proposes a novel byte-level Transformer architecture with larger blocks after specific bytes, reducing tokenization issues without sacrificing performance.
Findings
SpaceByte outperforms other byte-level models.
It matches the performance of tokenized Transformer models.
The architecture reduces tokenization-related disadvantages.
Abstract
Tokenization is widely used in large language models because it significantly improves performance. However, tokenization imposes several disadvantages, such as performance biases, increased adversarial vulnerability, decreased character-level modeling performance, and increased modeling complexity. To address these disadvantages without sacrificing performance, we propose SpaceByte, a novel byte-level decoder architecture that closes the performance gap between byte-level and subword autoregressive language modeling. SpaceByte consists of a byte-level Transformer model, but with extra larger transformer blocks inserted in the middle of the layers. We find that performance is significantly improved by applying these larger blocks only after certain bytes, such as space characters, which typically denote word boundaries. Our experiments show that for a fixed training and inference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Dropout · Dense Connections · Label Smoothing · Residual Connection · Softmax · Adam
