DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models
Wei He, Kai Han, Yehui Tang, Chengcheng Wang, Yujie Yang, Tianyu Guo,, Yunhe Wang

TL;DR
DenseMamba introduces DenseSSM, a novel architecture that enhances state space models with dense hidden connections, significantly improving performance of large language models while maintaining efficiency.
Contribution
The paper proposes DenseSSM, a new method that improves state space models by adding dense hidden connections, leading to better performance without sacrificing efficiency.
Findings
DenseRetNet outperforms original RetNet by up to 5% accuracy.
DenseSSM maintains training parallelizability and inference efficiency.
Applicable to various SSM types like RetNet and Mamba.
Abstract
Large language models (LLMs) face a daunting challenge due to the excessive computational and memory requirements of the commonly used Transformer architecture. While state space model (SSM) is a new type of foundational network architecture offering lower computational complexity, their performance has yet to fully rival that of Transformers. This paper introduces DenseSSM, a novel approach to enhance the flow of hidden information between layers in SSMs. By selectively integrating shallowlayer hidden states into deeper layers, DenseSSM retains fine-grained information crucial for the final output. Dense connections enhanced DenseSSM still maintains the training parallelizability and inference efficiency. The proposed method can be widely applicable to various SSM types like RetNet and Mamba. With similar model size, DenseSSM achieves significant improvements, exemplified by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Dropout · Multi-Head Attention · Softmax · Dense Connections · Label Smoothing · Adam
