TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding
Bingheng Wu, Jingze Shi, Yifan Wu, Nan Tang, and Yuyu Luo

TL;DR
TransXSSM introduces a unified rotary position embedding to effectively combine Transformers and State Space Models, resulting in faster training and inference, and improved accuracy on language modeling tasks.
Contribution
The paper proposes a novel unified rotary position embedding scheme that enables seamless integration of Transformer and SSM layers, improving performance and scalability.
Findings
TransXSSM achieves 42.3% faster training speed.
It surpasses Transformer baselines by over 4% in accuracy.
Scaling to 1.3B parameters yields 7.22% higher accuracy than smaller models.
Abstract
Transformers exhibit proficiency in capturing long-range dependencies, whereas State Space Models (SSMs) facilitate linear-time sequence modeling. Notwithstanding their synergistic potential, the integration of these architectures presents a significant challenge, primarily attributable to a fundamental incongr inuity their respective positional encoding mechanisms: Transformers rely on explicit Rotary Position Embeddings (RoPE), while SSMs leverage implicit positional representations via convolutions. This divergence often precipitates discontinuities and suboptimal performance.To address this impediment, we propose a unified rotary position embedding (Unified RoPE) methodology, thereby establishing a consistent positional encoding framework for both self-attention and state-space components. Using this Unified RoPE, we introduce TransXSSM, a hybrid architecture that coherently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Topic Modeling · Multimodal Machine Learning Applications
MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer
