Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level   Processing for Robust, Adaptable Language Models

Pit Neitemeier; Bj\"orn Deiseroth; Constantin Eichenberg; Lukas Balles

arXiv:2501.10322·cs.CL·January 22, 2025

Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models

Pit Neitemeier, Bj\"orn Deiseroth, Constantin Eichenberg, Lukas Balles

PDF

Open Access 6 Models

TL;DR

This paper introduces a hierarchical autoregressive transformer architecture that combines character- and word-level processing, resulting in more robust and adaptable language models that outperform traditional subword-based models in various settings.

Contribution

The authors propose a novel hierarchical transformer model that integrates character-level and word-level processing, improving robustness and training efficiency without relying on fixed vocabularies.

Findings

01

Matches subword models in downstream tasks

02

Significantly more robust to input perturbations

03

Nearly doubles training speed during out-of-domain pretraining

Abstract

Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process. While learned subword tokenizers have become the de-facto standard, they present challenges such as large vocabularies, limited adaptability to new domains or languages, and sensitivity to spelling errors and variations. To overcome these limitations, we investigate a hierarchical architecture for autoregressive language modelling that combines character-level and word-level processing. It employs a lightweight character-level encoder to convert character sequences into word embeddings, which are then processed by a word-level backbone model and decoded back into characters via a compact character-level decoder. This method retains the sequence compression benefits of word-level tokenization without relying on a rigid, predefined vocabulary. We demonstrate,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling