DUET: Disaggregated Hybrid Mamba-Transformer LLMs with Prefill and Decode-Specific Packages

Alish Kanani; Sangwan Lee; Han Lyu; Jiahao Lin; Jaehyun Park; Umit Y. Ogras

arXiv:2603.15530·cs.AR·March 17, 2026

DUET: Disaggregated Hybrid Mamba-Transformer LLMs with Prefill and Decode-Specific Packages

Alish Kanani, Sangwan Lee, Han Lyu, Jiahao Lin, Jaehyun Park, Umit Y. Ogras

PDF

Open Access

TL;DR

DUET is a disaggregated accelerator designed for hybrid Mamba-Transformer large language models, optimizing prefill and decode phases with specialized packages to significantly improve performance and throughput.

Contribution

It introduces a novel disaggregated architecture that assigns prefill and decode phases to specialized hardware packages, addressing performance bottlenecks in hybrid models.

Findings

01

4x faster time to first token

02

1.4x higher throughput

03

1.5x lower time between tokens

Abstract

Large language models operate in distinct compute-bound prefill followed by memory bandwidth-bound decode phases. Hybrid Mamba-Transformer models inherit this asymmetry while adding state space model (SSM) recurrences and element-wise operations that map poorly to matmul-centric accelerators. This mismatch causes performance bottlenecks, showing that a homogeneous architecture cannot satisfy all requirements. We introduce DUET, a disaggregated accelerator that assigns prefill and decode phases to specialized packages. The Prefill package utilizes systolic array chiplets with off-package memory for efficient large matrix multiplications and long-sequence SSMs. The Decode package utilizes vector-unit arrays with high-bandwidth in-package memory to accelerate token-by-token SSM and vector-matrix multiplications. Both architectures are runtime-configurable to support hybrid models with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Natural Language Processing Techniques