A Family of LLMs Liberated from Static Vocabularies
Aleph Alpha: Adnen Abdessaied, Artur Baranowski, Lukas Balles, Michael Barlow, Fabien C. Y. Benureau, Felix Berkenkamp, Lukas Bluebaum, Bastian Boll, Thomas F. Burns, Bj\"orn Deiseroth, Constantin Eichenberg, David Friede, Pablo Iyu Guerrero, Ahmed Hammam, Bastian Harren

TL;DR
This paper introduces a new hierarchical autoregressive transformer architecture that replaces static vocabularies with byte-level processing, improving adaptability, compression, and robustness in large language models, and demonstrates its effectiveness through extensive pre-training and fine-tuning.
Contribution
The authors propose the HAT architecture, enabling reuse of pre-trained models and replacing fixed vocabularies with byte-level encoding, enhancing flexibility and performance in LLMs.
Findings
HAT models outperform original Llama models on most benchmarks.
Byte-level processing improves text compression and robustness.
Pre-trained and fine-tuned HAT models show strong multilingual proficiency.
Abstract
Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70 billion parameters based on the hierarchical autoregressive transformer (HAT) architecture. In HAT, an encoder transformer aggregates bytes into word embeddings and then feeds them to the backbone, a classical autoregressive transformer. The outputs of the backbone are then cross-attended by the decoder and converted back into bytes. We show that we can reuse available pre-trained models by converting the Llama 3.1 8B and 70B models into the HAT architecture: Llama-3.1-8B-TFree-HAT and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
