Jamba: A Hybrid Transformer-Mamba Language Model
Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay, Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai, Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman, Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen

TL;DR
Jamba introduces a hybrid Transformer-Mamba mixture-of-experts architecture that combines the strengths of both models, achieving high performance, scalability, and efficiency for large language modeling tasks with long-context capabilities.
Contribution
The paper presents a novel hybrid Transformer-Mamba MoE architecture that enhances model capacity and efficiency, with extensive analysis and publicly available checkpoints.
Findings
High throughput and small memory footprint compared to vanilla Transformers
State-of-the-art performance on standard benchmarks
Effective handling of 256K token context length
Abstract
We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. We study various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Discourse, Communication Strategies · Multilingual Education and Policy
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Multi-Head Attention · Softmax · Dense Connections · Label Smoothing · Adam · Absolute Position Encodings
