Zebra-Llama: Towards Extremely Efficient Hybrid Models

Mingyu Yang; Mehdi Rezagholizadeh; Guihong Li; Vikram Appia; Emad Barsoum

arXiv:2505.17272·cs.LG·January 21, 2026

Zebra-Llama: Towards Extremely Efficient Hybrid Models

Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, Emad Barsoum

PDF

10 Models

TL;DR

Zebra-Llama introduces efficient hybrid language models combining pre-trained transformers with SSMs, achieving high accuracy with significantly reduced training tokens and KV cache size, enabling scalable deployment.

Contribution

It presents a scalable method to compose hybrid models from existing pre-trained models, reducing training costs and memory while maintaining near-transformer accuracy.

Findings

01

Achieves transformer-level accuracy with 7-11B training tokens.

02

Reduces KV cache size to under 3% of original.

03

Outperforms comparable models in accuracy and efficiency.

Abstract

With the growing demand for deploying large language models (LLMs) across diverse applications, improving their inference efficiency is crucial for sustainable and democratized access. However, retraining LLMs to meet new user-specific requirements is prohibitively expensive and environmentally unsustainable. In this work, we propose a practical and scalable alternative: composing efficient hybrid language models from existing pre-trained models. Our approach, Zebra-Llama, introduces a family of 1B, 3B, and 8B hybrid models by combining State Space Models (SSMs) and Multi-head Latent Attention (MLA) layers, using a refined initialization and post-training pipeline to efficiently transfer knowledge from pre-trained Transformers. Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7-11B training tokens (compared to trillions of tokens required for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need