Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber; Barak Lenz; Hofit Bata; Gal Cohen; Jhonathan Osin; Itay; Dalmedigos; Erez Safahi; Shaked Meirom; Yonatan Belinkov; Shai; Shalev-Shwartz; Omri Abend; Raz Alon; Tomer Asida; Amir Bergman; Roman; Glozman; Michael Gokhman; Avashalom Manevich; Nir Ratner; Noam Rozen; Erez; Shwartz; Mor Zusman; Yoav Shoham

arXiv:2403.19887·cs.CL·July 4, 2024·41 cites

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay, Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai, Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman, Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen

PDF

Open Access 3 Repos 5 Models

TL;DR

Jamba introduces a hybrid Transformer-Mamba mixture-of-experts architecture that combines the strengths of both models, achieving high performance, scalability, and efficiency for large language modeling tasks with long-context capabilities.

Contribution

The paper presents a novel hybrid Transformer-Mamba MoE architecture that enhances model capacity and efficiency, with extensive analysis and publicly available checkpoints.

Findings

01

High throughput and small memory footprint compared to vanilla Transformers

02

State-of-the-art performance on standard benchmarks

03

Effective handling of 256K token context length

Abstract

We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. We study various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage, Discourse, Communication Strategies · Multilingual Education and Policy

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Multi-Head Attention · Softmax · Dense Connections · Label Smoothing · Adam · Absolute Position Encodings