Falcon Mamba: The First Competitive Attention-free 7B Language Model

Jingwei Zuo; Maksim Velikanov; Dhia Eddine Rhaiem; Ilyas Chahed,; Younes Belkada; Guillaume Kunsch; Hakim Hacid

arXiv:2410.05355·cs.CL·October 10, 2024·3 cites

Falcon Mamba: The First Competitive Attention-free 7B Language Model

Jingwei Zuo, Maksim Velikanov, Dhia Eddine Rhaiem, Ilyas Chahed,, Younes Belkada, Guillaume Kunsch, Hakim Hacid

PDF

Open Access 10 Models

TL;DR

Falcon Mamba 7B introduces a novel attention-free architecture that outperforms many Transformer-based models in speed, memory efficiency, and accuracy, establishing a new state-of-the-art for pure Mamba models.

Contribution

This paper presents Falcon Mamba 7B, the first competitive pure Mamba architecture large language model, surpassing Transformer and hybrid models in performance and efficiency.

Findings

01

Outperforms leading open-weight models like Mistral 7B and Llama3.1 8B.

02

Achieves top performance among Mamba models on the Open LLM Leaderboard.

03

Offers faster inference and lower memory usage for long sequences.

Abstract

In this technical report, we present Falcon Mamba 7B, a new base large language model based on the novel Mamba architecture. Falcon Mamba 7B is trained on 5.8 trillion tokens with carefully selected data mixtures. As a pure Mamba-based model, Falcon Mamba 7B surpasses leading open-weight models based on Transformers, such as Mistral 7B, Llama3.1 8B, and Falcon2 11B. It is on par with Gemma 7B and outperforms models with different architecture designs, such as RecurrentGemma 9B and RWKV-v6 Finch 7B/14B. Currently, Falcon Mamba 7B is the best-performing Mamba model in the literature at this scale, surpassing both existing Mamba and hybrid Mamba-Transformer models, according to the Open LLM Leaderboard. Due to its architecture, Falcon Mamba 7B is significantly faster at inference and requires substantially less memory for long sequence generation. Despite recent studies suggesting that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDNA and Biological Computing

MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings