Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation
Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, Hao Cheng, Jianfeng Gao, Weizhu Chen, Yelong Shen

TL;DR
This paper introduces SambaY, a hybrid decoder architecture with shared memory mechanisms that significantly improves long-context reasoning efficiency and scalability in language models, outperforming previous models in speed and accuracy.
Contribution
The paper proposes the Gated Memory Unit (GMU) and SambaY architecture, enabling efficient memory sharing and improved long-context reasoning without positional encoding.
Findings
SambaY achieves up to 10x higher decoding throughput on long prompts.
Model exhibits lower irreducible loss and better scalability than YOCO baseline.
Significant performance improvements on reasoning benchmarks like Math500 and AIME24/25.
Abstract
Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through…
Peer Reviews
Decision·NeurIPS 2025 poster
**Strengths** - The experiments are quite extensive and include parameter and data scaling studies for the considered architectures, with carefully scaled hyperparameters. - The proofs of concept experiments on proprietary data are quite convincing. The gains of Phi4-mini-Flash, which incorporates the proposed GMU as well as pre-existing DA and other improvements, over Phi4-mini are substantial. Throughput gains over Phi4-mini-Reasoning in reasoning tasks are also noteworthy. **Weaknesses** - I
**Strengths** 1. The architecture introduces GMU, and shows how the modifications help with modeling 2. The paper throughly explores all axes (scaling model size, dataset size, pre-training haprams, long-context retrieval, downstream evaluation) 3. The different architecture ablations are useful to understand the design choices of the model. 4. The paper also shows inference scaling (through random weights in vLLM) to showcase how the linear-time complexity behavior they talk about. **Weakness
Strengths: - Extensive and large-scale experiments on scaling behavior and good scaling performance - The scaling setup is explained in detail and the relations between width, depth, and hyperparameters are given as formulas. - Detailed ablation study on long context performance - Superior performance compared to Transformer++ baseline on Long-Context Benchmarks Weaknesses: - In the ablation study in Table 5, the Differential Attention used in the final Phi4-mini-Flash model is missing - The au
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
