Hymba: A Hybrid-head Architecture for Small Language Models

Xin Dong; Yonggan Fu; Shizhe Diao; Wonmin Byeon; Zijia Chen; Ameya; Sunil Mahabaleshwarkar; Shih-Yang Liu; Matthijs Van Keirsbilck; Min-Hung; Chen; Yoshi Suhara; Yingyan Lin; Jan Kautz; Pavlo Molchanov

arXiv:2411.13676·cs.CL·November 22, 2024·2 cites

Hymba: A Hybrid-head Architecture for Small Language Models

Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya, Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung, Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov

PDF

Open Access 7 Models 3 Reviews

TL;DR

Hymba introduces a hybrid-head architecture combining transformers and state space models for small language models, achieving superior performance and efficiency through innovative design and optimization techniques.

Contribution

The paper presents a novel hybrid-head architecture with learnable meta tokens and optimized mechanisms, significantly improving small LMs' performance and efficiency.

Findings

01

Hymba-1.5B-Base outperforms sub-2B models.

02

Achieves 1.32% higher accuracy than Llama-3.2-3B.

03

Reduces cache size by 11.67x and increases throughput by 3.49x.

Abstract

We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 3

Strengths

1. this is a solid and well-written paper. 2. the hybrid-head design, which combines attention and state space models, is an innovative approach that provides Hymba with both fine-grained recall and efficient long-range summarization. The introduction of learnable meta tokens as a dynamic cache initialization mechanism is also novel, drawing a parallel to human metamemory functions. 3. the experiments are extensive and well-documented, including ablation studies that thoroughly evaluate the impa

Weaknesses

1. It would be even better if the effectiveness of the Hymba could be validated on image or speech modalities.

Reviewer 02Rating 8Confidence 3

Strengths

1. Empirical study is comprehensive and clear, particularly the ablation studies; 2. Consistent gains for models with different sizes under 1.5B 3. The paper is easy to follow

Weaknesses

In general I think this is a strong paper. I have the following comments and questions. - Some implementation details can be added 1. I am a bit lost while reading the cache optimization, and meta-token, maybe worth more explanations or pseudocode? - How many meta tokens are needed, and how they are related to the performance in downstream tasks? - The ratio between SSM and Attention is not clear. And I understand that recent papers demonstrate that it is important to integrate attention for

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper introduces several useful designs that could be empirically used for training hybrid Transformer + SSM models. 2. The proposed method shows high performance, outperforming most small LMs while achieving better computation efficiency. 3. The authors perform extensive evaluations across diverse tasks and setups.a

Weaknesses

1. Limited novelty. The paper seems to suggest a combination of implementation tricks rather than proposing a significant idea. 2. The hybrid head design seems to be the most significant component of the proposed method, but the evaluation justifying its efficacy is confusing. In Figure 3, the authors compare their method with Samba and claim they achieved a larger ERF. However, it is unclear if the gain comes from the parallel design or the introduction of global attention heads (not present in

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need