ZAYA1-8B Technical Report
Robert Washbourne, Rishi Iyer, Tomas Figliolia, Henry Zheng, Ryan Lorig-Roach, Sungyeon Yang, Pritish Yuvraj, Quentin Anthony, Yury Tokpanov, Xiao Yang, Ganesh Nanduru, Stephen Ebert, Praneeth Medepalli, Skyler Szot, Srivatsan Rajagopal, Alex Ong, Bhavana Mehta, Beren Millidge

TL;DR
ZAYA1-8B is a 700M-parameter mixture-of-experts model optimized for reasoning tasks, achieving competitive performance on mathematics and coding benchmarks through advanced training and test-time compute techniques.
Contribution
The paper introduces ZAYA1-8B, a reasoning-focused MoE model with novel training, fine-tuning, and test-time aggregation methods, including Markovian RSA, to enhance reasoning performance.
Findings
ZAYA1-8B matches or exceeds larger models on math and coding benchmarks.
Markovian RSA improves reasoning trace aggregation, boosting test performance.
ZAYA1-8B achieves 91.9% on AIME'25 and 89.6% on HMMT'25.
Abstract
We present ZAYA1-8B, a reasoning-focused mixture-of-experts (MoE) model with 700M active and 8B total parameters, built on Zyphra's MoE++ architecture. ZAYA1-8B's core pretraining, midtraining, and supervised fine-tuning (SFT) were performed on a full-stack AMD compute, networking, and software platform. With under 1B active parameters, ZAYA1-8B matches or exceeds DeepSeek-R1-0528 on several challenging mathematics and coding benchmarks, and remains competitive with substantially larger open-weight reasoning models. ZAYA1-8B was trained from scratch for reasoning, with reasoning data included from pretraining onward using an answer-preserving trimming scheme. Post-training uses a four-stage RL cascade: reasoning warmup on math and puzzles; a 400-task RLVE-Gym curriculum; math and code RL with test-time compute traces and synthetic code environments built from competitive-programming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
