Multiscale Byte Language Models -- A Hierarchical Architecture for   Causal Million-Length Sequence Modeling

Eric Egli; Matteo Manica; Jannis Born

arXiv:2502.14553·cs.CL·February 21, 2025

Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling

Eric Egli, Matteo Manica, Jannis Born

PDF

Open Access 1 Repo

TL;DR

This paper introduces the Multiscale Byte Language Model (MBLM), a hierarchical architecture capable of efficiently modeling extremely long byte sequences for multimodal tasks, including visual question answering, with near-linear generation efficiency.

Contribution

The paper presents the first hierarchical byte language model that handles multimodal data and extremely long sequences efficiently, demonstrating competitive performance on visual Q&A tasks.

Findings

01

MBLM can process 5 million byte context windows on a single GPU.

02

It achieves near-linear efficiency in sequence generation.

03

MBLM matches specialized CNN-LSTM models on visual Q&A tasks.

Abstract

Bytes form the basis of the digital world and thus are a promising building block for multimodal foundation models. Recently, Byte Language Models (BLMs) have emerged to overcome tokenization, yet the excessive length of bytestreams requires new architectural paradigms. Therefore, we present the Multiscale Byte Language Model (MBLM), a model-agnostic hierarchical decoder stack that allows training with context windows of $5$ M bytes on single GPU in full model precision. We thoroughly examine MBLM's performance with Transformer and Mamba blocks on both unimodal and multimodal tasks. Our experiments demonstrate that hybrid architectures are efficient in handling extremely long byte sequences during training while achieving near-linear generational efficiency. To the best of our knowledge, we present the first evaluation of BLMs on visual Q\&A tasks and find that, despite serializing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai4sd/multiscale-byte-lm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Scientific Computing and Data Management · Biomedical Text Mining and Ontologies

MethodsAttention Is All You Need · Absolute Position Encodings · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Label Smoothing · Multi-Head Attention · Mamba: Linear-Time Sequence Modeling with Selective State Spaces