TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding

Bingheng Wu; Jingze Shi; Yifan Wu; Nan Tang; and Yuyu Luo

arXiv:2506.09507·cs.CL·June 19, 2025

TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding

Bingheng Wu, Jingze Shi, Yifan Wu, Nan Tang, and Yuyu Luo

PDF

Open Access

TL;DR

TransXSSM introduces a unified rotary position embedding to effectively combine Transformers and State Space Models, resulting in faster training and inference, and improved accuracy on language modeling tasks.

Contribution

The paper proposes a novel unified rotary position embedding scheme that enables seamless integration of Transformer and SSM layers, improving performance and scalability.

Findings

01

TransXSSM achieves 42.3% faster training speed.

02

It surpasses Transformer baselines by over 4% in accuracy.

03

Scaling to 1.3B parameters yields 7.22% higher accuracy than smaller models.

Abstract

Transformers exhibit proficiency in capturing long-range dependencies, whereas State Space Models (SSMs) facilitate linear-time sequence modeling. Notwithstanding their synergistic potential, the integration of these architectures presents a significant challenge, primarily attributable to a fundamental incongr inuity their respective positional encoding mechanisms: Transformers rely on explicit Rotary Position Embeddings (RoPE), while SSMs leverage implicit positional representations via convolutions. This divergence often precipitates discontinuities and suboptimal performance.To address this impediment, we propose a unified rotary position embedding (Unified RoPE) methodology, thereby establishing a consistent positional encoding framework for both self-attention and state-space components. Using this Unified RoPE, we introduce TransXSSM, a hybrid architecture that coherently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Topic Modeling · Multimodal Machine Learning Applications

MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer