Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models
Pit Neitemeier, Bj\"orn Deiseroth, Constantin Eichenberg, Lukas Balles

TL;DR
This paper introduces a hierarchical autoregressive transformer architecture that combines character- and word-level processing, resulting in more robust and adaptable language models that outperform traditional subword-based models in various settings.
Contribution
The authors propose a novel hierarchical transformer model that integrates character-level and word-level processing, improving robustness and training efficiency without relying on fixed vocabularies.
Findings
Matches subword models in downstream tasks
Significantly more robust to input perturbations
Nearly doubles training speed during out-of-domain pretraining
Abstract
Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process. While learned subword tokenizers have become the de-facto standard, they present challenges such as large vocabularies, limited adaptability to new domains or languages, and sensitivity to spelling errors and variations. To overcome these limitations, we investigate a hierarchical architecture for autoregressive language modelling that combines character-level and word-level processing. It employs a lightweight character-level encoder to convert character sequences into word embeddings, which are then processed by a word-level backbone model and decoded back into characters via a compact character-level decoder. This method retains the sequence compression benefits of word-level tokenization without relying on a rigid, predefined vocabulary. We demonstrate,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Aleph-Alpha/llama-3_1-8b-tfree-hat-basemodel· 21 dl· ♡ 2221 dl♡ 22
- 🤗Aleph-Alpha/llama-3_1-8b-tfree-hat-sftmodel· 6 dl· ♡ 126 dl♡ 12
- 🤗Aleph-Alpha/llama-3_1-8b-tfree-hat-dpomodel· 19 dl· ♡ 1519 dl♡ 15
- 🤗Aleph-Alpha/llama-tfree-hat-pretrained-7b-dpomodel· 64 dl· ♡ 1064 dl♡ 10
- 🤗Aleph-Alpha/tfree-hat-pretrained-7b-basemodel· 1.5k dl· ♡ 161.5k dl♡ 16
- 🤗Aleph-Alpha/llama-3_1-70b-tfree-hat-sftmodel· 40 dl· ♡ 140 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
