Hymba: A Hybrid-head Architecture for Small Language Models
Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya, Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung, Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov

TL;DR
Hymba introduces a hybrid-head architecture combining transformers and state space models for small language models, achieving superior performance and efficiency through innovative design and optimization techniques.
Contribution
The paper presents a novel hybrid-head architecture with learnable meta tokens and optimized mechanisms, significantly improving small LMs' performance and efficiency.
Findings
Hymba-1.5B-Base outperforms sub-2B models.
Achieves 1.32% higher accuracy than Llama-3.2-3B.
Reduces cache size by 11.67x and increases throughput by 3.49x.
Abstract
We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our…
Peer Reviews
Decision·ICLR 2025 Spotlight
1. this is a solid and well-written paper. 2. the hybrid-head design, which combines attention and state space models, is an innovative approach that provides Hymba with both fine-grained recall and efficient long-range summarization. The introduction of learnable meta tokens as a dynamic cache initialization mechanism is also novel, drawing a parallel to human metamemory functions. 3. the experiments are extensive and well-documented, including ablation studies that thoroughly evaluate the impa
1. It would be even better if the effectiveness of the Hymba could be validated on image or speech modalities.
1. Empirical study is comprehensive and clear, particularly the ablation studies; 2. Consistent gains for models with different sizes under 1.5B 3. The paper is easy to follow
In general I think this is a strong paper. I have the following comments and questions. - Some implementation details can be added 1. I am a bit lost while reading the cache optimization, and meta-token, maybe worth more explanations or pseudocode? - How many meta tokens are needed, and how they are related to the performance in downstream tasks? - The ratio between SSM and Attention is not clear. And I understand that recent papers demonstrate that it is important to integrate attention for
1. The paper introduces several useful designs that could be empirically used for training hybrid Transformer + SSM models. 2. The proposed method shows high performance, outperforming most small LMs while achieving better computation efficiency. 3. The authors perform extensive evaluations across diverse tasks and setups.a
1. Limited novelty. The paper seems to suggest a combination of implementation tricks rather than proposing a significant idea. 2. The hybrid head design seems to be the most significant component of the proposed method, but the evaluation justifying its efficacy is confusing. In Figure 3, the authors compare their method with Samba and claim they achieved a larger ERF. However, it is unclear if the gain comes from the parallel design or the introduction of global attention heads (not present in
Code & Models
- 🤗nvidia/Hymba-1.5B-Instructmodel· 559 dl· ♡ 245559 dl♡ 245
- 🤗nvidia/Hymba-1.5B-Basemodel· 572 dl· ♡ 156572 dl♡ 156
- 🤗iamashishsharma8/HybridSLMmodel· 1 dl1 dl
- 🤗himanshu-skid19/Hymba-1.5B-modifiedmodel· 3 dl3 dl
- 🤗himanshu-skid19/Hymba-1.5B-debugmodel· 2 dl2 dl
- 🤗himanshu-skid19/Hymba-1.5B-MAGmodel· 2 dl2 dl
- 🤗himanshu-skid19/Hymba-1.5B-Custommodel· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need
