A Family of LLMs Liberated from Static Vocabularies

Aleph Alpha: Adnen Abdessaied; Artur Baranowski; Lukas Balles; Michael Barlow; Fabien C. Y. Benureau; Felix Berkenkamp; Lukas Bluebaum; Bastian Boll; Thomas F. Burns; Bj\"orn Deiseroth; Constantin Eichenberg; David Friede; Pablo Iyu Guerrero; Ahmed Hammam; Bastian Harren; Johann Higl; Yasser Jadidi; Carina Kauf; Johannes Messner; Jan Hendrik Metzen; Max Meuer; Vedant Nanda; Pit Neitemeier; Koen Oostermeijer; Letitia Parcalabescu; Markus Pernpointner; Felix Reinfurt; Dylan Rodriquez; Gr\'egory Schott; Philipp Siedler; Martin Simonovsky; Till Speicher; Volker Stampa; Stephan W\"aldchen; Samuel Weinbach; Gregor Ziegltrum

arXiv:2603.15953·cs.CL·March 18, 2026

A Family of LLMs Liberated from Static Vocabularies

Aleph Alpha: Adnen Abdessaied, Artur Baranowski, Lukas Balles, Michael Barlow, Fabien C. Y. Benureau, Felix Berkenkamp, Lukas Bluebaum, Bastian Boll, Thomas F. Burns, Bj\"orn Deiseroth, Constantin Eichenberg, David Friede, Pablo Iyu Guerrero, Ahmed Hammam, Bastian Harren

PDF

Open Access

TL;DR

This paper introduces a new hierarchical autoregressive transformer architecture that replaces static vocabularies with byte-level processing, improving adaptability, compression, and robustness in large language models, and demonstrates its effectiveness through extensive pre-training and fine-tuning.

Contribution

The authors propose the HAT architecture, enabling reuse of pre-trained models and replacing fixed vocabularies with byte-level encoding, enhancing flexibility and performance in LLMs.

Findings

01

HAT models outperform original Llama models on most benchmarks.

02

Byte-level processing improves text compression and robustness.

03

Pre-trained and fine-tuned HAT models show strong multilingual proficiency.

Abstract

Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70 billion parameters based on the hierarchical autoregressive transformer (HAT) architecture. In HAT, an encoder transformer aggregates bytes into word embeddings and then feeds them to the backbone, a classical autoregressive transformer. The outputs of the backbone are then cross-attended by the decoder and converted back into bytes. We show that we can reuse available pre-trained models by converting the Llama 3.1 8B and 70B models into the HAT architecture: Llama-3.1-8B-TFree-HAT and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications