Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling
Eric Egli, Matteo Manica, Jannis Born

TL;DR
This paper introduces the Multiscale Byte Language Model (MBLM), a hierarchical architecture capable of efficiently modeling extremely long byte sequences for multimodal tasks, including visual question answering, with near-linear generation efficiency.
Contribution
The paper presents the first hierarchical byte language model that handles multimodal data and extremely long sequences efficiently, demonstrating competitive performance on visual Q&A tasks.
Findings
MBLM can process 5 million byte context windows on a single GPU.
It achieves near-linear efficiency in sequence generation.
MBLM matches specialized CNN-LSTM models on visual Q&A tasks.
Abstract
Bytes form the basis of the digital world and thus are a promising building block for multimodal foundation models. Recently, Byte Language Models (BLMs) have emerged to overcome tokenization, yet the excessive length of bytestreams requires new architectural paradigms. Therefore, we present the Multiscale Byte Language Model (MBLM), a model-agnostic hierarchical decoder stack that allows training with context windows of M bytes on single GPU in full model precision. We thoroughly examine MBLM's performance with Transformer and Mamba blocks on both unimodal and multimodal tasks. Our experiments demonstrate that hybrid architectures are efficient in handling extremely long byte sequences during training while achieving near-linear generational efficiency. To the best of our knowledge, we present the first evaluation of BLMs on visual Q\&A tasks and find that, despite serializing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Scientific Computing and Data Management · Biomedical Text Mining and Ontologies
MethodsAttention Is All You Need · Absolute Position Encodings · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Label Smoothing · Multi-Head Attention · Mamba: Linear-Time Sequence Modeling with Selective State Spaces
