From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

Mathurin Videau; Badr Youbi Idrissi; Alessandro Leite; Marc Schoenauer; Olivier Teytaud; David Lopez-Paz

arXiv:2506.14761·cs.CL·June 18, 2025

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

Mathurin Videau, Badr Youbi Idrissi, Alessandro Leite, Marc Schoenauer, Olivier Teytaud, David Lopez-Paz

PDF

Open Access 1 Repo

TL;DR

This paper introduces an autoregressive U-Net that learns to embed its own tokens from raw bytes, enabling flexible, multi-scale language modeling that adapts to different tasks and languages.

Contribution

It proposes a novel U-Net based architecture that learns tokenization during training, moving beyond fixed schemes like BPE, and demonstrates its effectiveness across various language modeling tasks.

Findings

01

Shallow hierarchies match strong BPE baselines.

02

Deeper hierarchies show promising trends in performance.

03

Model handles character-level tasks and low-resource languages.

Abstract

Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and leave the model stuck with that choice. We relax this rigidity by introducing an autoregressive U-Net that learns to embed its own tokens as it trains. The network reads raw bytes, pools them into words, then pairs of words, then up to 4 words, giving it a multi-scale view of the sequence. At deeper stages, the model must predict further into the future -- anticipating the next few words rather than the next byte -- so deeper stages focus on broader semantic patterns while earlier stages handle fine details. When carefully tuning and controlling pretraining compute, shallow hierarchies tie strong BPE baselines, and deeper hierarchies have a promising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/lingua
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsByte Pair Encoding · Focus