TL;DR
Zebra-Llama introduces efficient hybrid language models combining pre-trained transformers with SSMs, achieving high accuracy with significantly reduced training tokens and KV cache size, enabling scalable deployment.
Contribution
It presents a scalable method to compose hybrid models from existing pre-trained models, reducing training costs and memory while maintaining near-transformer accuracy.
Findings
Achieves transformer-level accuracy with 7-11B training tokens.
Reduces KV cache size to under 3% of original.
Outperforms comparable models in accuracy and efficiency.
Abstract
With the growing demand for deploying large language models (LLMs) across diverse applications, improving their inference efficiency is crucial for sustainable and democratized access. However, retraining LLMs to meet new user-specific requirements is prohibitively expensive and environmentally unsustainable. In this work, we propose a practical and scalable alternative: composing efficient hybrid language models from existing pre-trained models. Our approach, Zebra-Llama, introduces a family of 1B, 3B, and 8B hybrid models by combining State Space Models (SSMs) and Multi-head Latent Attention (MLA) layers, using a refined initialization and post-training pipeline to efficiently transfer knowledge from pre-trained Transformers. Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7-11B training tokens (compared to trillions of tokens required for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗amd/Zebra-Llama-1B-4MLA-12Mamba-DPOmodel· 21 dl21 dl
- 🤗amd/Zebra-Llama-1B-4MLA-12Mamba-SFTmodel· 115 dl115 dl
- 🤗amd/Zebra-Llama-1B-8MLA-8Mamba-SFTmodel· 89 dl89 dl
- 🤗amd/Zebra-Llama-1B-8MLA-8Mamba-DPOmodel· 122 dl122 dl
- 🤗amd/Zebra-Llama-3B-6MLA-22Mamba-SFTmodel· 44 dl44 dl
- 🤗amd/Zebra-Llama-3B-6MLA-22Mamba-DPOmodel· 25 dl25 dl
- 🤗amd/Zebra-Llama-8B-8MLA-24Mamba-SFTmodel· 9 dl· ♡ 79 dl♡ 7
- 🤗amd/Zebra-Llama-8B-8MLA-24Mamba-DPOmodel· 10 dl10 dl
- 🤗amd/Zebra-Llama-3B-14MLA-14Mamba-DPOmodel· 31 dl31 dl
- 🤗amd/Zebra-Llama-3B-14MLA-14Mamba-SFTmodel· 39 dl39 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
