Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding
StepFun: Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, Song Yuan, Wuxun Xie, Xiaoniu Song, Xing Chen, Xingping Yang, Xuelin Zhang, Yanbo Yu, Yaoyu Wang, Yibo Zhu, Yimin Jiang, Yu Zhou, Yuanwei Lu, Houyi Li

TL;DR
This paper presents Step-3, a cost-effective large language model system that employs novel model-system co-design techniques, significantly reducing decoding costs and increasing throughput for long-context tasks.
Contribution
Introduction of Step-3, a 321B-parameter LLM with innovative MFA and AFD techniques for hardware-efficient, low-cost decoding.
Findings
Decodes at 4,039 tokens/sec per GPU, outperforming DeepSeek-V3.
Reduces decoding costs compared to similar models, especially at longer contexts.
Achieves a new Pareto frontier for LLM decoding efficiency.
Abstract
Large language models (LLMs) face low hardware efficiency during decoding, especially for long-context reasoning tasks. This paper introduces Step-3, a 321B-parameter VLM with hardware-aware model-system co-design optimized for minimizing decoding costs. Step-3 innovates in two key dimensions: (1) A novel Multi-Matrix Factorization Attention (MFA) mechanism that significantly reduces both KV cache size and computation while maintaining high attention expressiveness, and (2) Attention-FFN Disaggregation (AFD), a distributed inference system that decouples attention and Feed-Forward Network (FFN) layers into specialized subsystems. This co-design achieves unprecedented cost efficiency: Step-3 significantly reduces theoretical decoding costs compared with models like DeepSeek-V3 and Qwen3 MoE 235B, with the gains widening at longer context. Step-3 achieves low cost while activating 38B…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗stepfun-ai/Step-3.5-Flashmodel· 93k dl· ♡ 75493k dl♡ 754
- 🤗stepfun-ai/Step-3.5-Flash-FP8model· 298k dl· ♡ 51298k dl♡ 51
- 🤗stepfun-ai/step3model· 103k dl· ♡ 166103k dl♡ 166
- 🤗stepfun-ai/step3-fp8model· 29 dl· ♡ 2029 dl♡ 20
- 🤗stepfun-ai/Step-3.5-Flash-GGUF-Q4_K_Smodel· 15k dl· ♡ 14015k dl♡ 140
- 🤗void-818/Affine-luca_v9-5CtFSMCbvHryns4E7YrACNDyFYAcxGU9SkokGPHiJuvPNUcimodel· 25 dl25 dl
- 🤗milkowski/Step-3.5-Flash-GGUFmodel· 6 dl· ♡ 26 dl♡ 2
- 🤗stepfun-ai/Step-3.5-Flash-GGUF-Q8_0model· 232 dl· ♡ 3232 dl♡ 3
- 🤗cyankiwi/Step-3.5-Flash-AWQ-4bitmodel· 209 dl209 dl
- 🤗Tawheeb123/Step-3.5-Flashmodel· 16 dl16 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
