Falcon Mamba: The First Competitive Attention-free 7B Language Model
Jingwei Zuo, Maksim Velikanov, Dhia Eddine Rhaiem, Ilyas Chahed,, Younes Belkada, Guillaume Kunsch, Hakim Hacid

TL;DR
Falcon Mamba 7B introduces a novel attention-free architecture that outperforms many Transformer-based models in speed, memory efficiency, and accuracy, establishing a new state-of-the-art for pure Mamba models.
Contribution
This paper presents Falcon Mamba 7B, the first competitive pure Mamba architecture large language model, surpassing Transformer and hybrid models in performance and efficiency.
Findings
Outperforms leading open-weight models like Mistral 7B and Llama3.1 8B.
Achieves top performance among Mamba models on the Open LLM Leaderboard.
Offers faster inference and lower memory usage for long sequences.
Abstract
In this technical report, we present Falcon Mamba 7B, a new base large language model based on the novel Mamba architecture. Falcon Mamba 7B is trained on 5.8 trillion tokens with carefully selected data mixtures. As a pure Mamba-based model, Falcon Mamba 7B surpasses leading open-weight models based on Transformers, such as Mistral 7B, Llama3.1 8B, and Falcon2 11B. It is on par with Gemma 7B and outperforms models with different architecture designs, such as RecurrentGemma 9B and RWKV-v6 Finch 7B/14B. Currently, Falcon Mamba 7B is the best-performing Mamba model in the literature at this scale, surpassing both existing Mamba and hybrid Mamba-Transformer models, according to the Open LLM Leaderboard. Due to its architecture, Falcon Mamba 7B is significantly faster at inference and requires substantially less memory for long sequence generation. Despite recent studies suggesting that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗tiiuae/falcon-mamba-7b-instructmodel· 47k dl· ♡ 7247k dl♡ 72
- 🤗tiiuae/falcon-mamba-7bmodel· 20k dl· ♡ 24120k dl♡ 241
- 🤗tiiuae/falcon-mamba-7b-4bitmodel· 27 dl· ♡ 1127 dl♡ 11
- 🤗tiiuae/falcon-mamba-7b-instruct-4bitmodel· 7 dl· ♡ 127 dl♡ 12
- 🤗tiiuae/falcon-mamba-7b-instruct-Q8_0-GGUFmodel· 23 dl· ♡ 523 dl♡ 5
- 🤗tiiuae/falcon-mamba-7b-Q8_0-GGUFmodel· 21 dl· ♡ 221 dl♡ 2
- 🤗tiiuae/falcon-mamba-7b-F16-GGUFmodel· 15 dl· ♡ 115 dl♡ 1
- 🤗tiiuae/falcon-mamba-7b-BF16-GGUFmodel· 10 dl· ♡ 210 dl♡ 2
- 🤗tiiuae/falcon-mamba-7b-instruct-F16-GGUFmodel· 6 dl· ♡ 26 dl♡ 2
- 🤗tiiuae/falcon-mamba-7b-instruct-BF16-GGUFmodel· 11 dl· ♡ 111 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDNA and Biological Computing
MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
